Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save piegu/f7d761ea850cb3a92ea58bfcf8ab2f2b to your computer and use it in GitHub Desktop.
Save piegu/f7d761ea850cb3a92ea58bfcf8ab2f2b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Author** Andrew Reece \n",
"**Contact** andrew.reece@gmail.com &lt; <a href=\"https://www.linkedin.com/in/andrewreece/\">li</a> &gt; &lt; <a href=\"https://stackoverflow.com/users/2799941/andrew-reece\">so</a> &gt; \n",
"**Date** 23 November 2018"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Motivation** \n",
"Can we train a language model to distinguish between classical works of fiction? "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from fastai import *\n",
"from fastai.text import *\n",
"from fastai.vision import ClassificationInterpretation\n",
"import pandas as pd\n",
"import pathlib"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a subset of works of fiction from Project Gutenberg (in .tar.gz format): \n",
"https://drive.google.com/open?id=1Atth50tE_J7FGRBwsr2Mr11DC0i1JBRQ\n",
"\n",
"Untar it in the same directory as this notebook."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#! wget https://drive.google.com/open?id=1Atth50tE_J7FGRBwsr2Mr11DC0i1JBRQ\n",
"#! tar -zxvf gutenberg.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"gut_path = pathlib.Path(\"gutenberg/\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fiction from Project Gutenberg\n",
"\n",
"The Gutenberg metadata has been removed from these files, and the first line gives the title, author, and publication year in a systematic pattern. \n",
"In total, this dataset contains 26 works of fiction."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"26"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gutenberg_filenames = list(gut_path.glob(\"*.txt\"))\n",
"len(gutenberg_filenames)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"File names follow the convention `<author_abbreviation>-<title_abbreviation>`. \n",
"You can isolate the file name with the following command"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'twain-huckleberry_finn'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gutenberg_filenames[0].name.split(\".\")[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Preprocessing** \n",
"The function below takes a simple approach to splitting up a text by paragraphs. \n",
"We'll use the paragraphs created from `gutenberg_iterator` as inputs to our language model. \n",
"\n",
"This code is a modified version of a function originally written by <a href=\"https://web.stanford.edu/~cgpotts/\">Chris Potts</a>, for <a href=\"http://web.stanford.edu/class/linguist278/\">Linguistics 278</a> at Stanford. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def gutenberg_iterator(filename):\n",
" \"\"\"Yields paragraphs (as defined simply by multiple \n",
" newlines in a row).\n",
" \n",
" Parameters\n",
" ----------\n",
" filename : str\n",
" Full path to the file.\n",
" \n",
" Yields\n",
" ------\n",
" multiline str\n",
" \n",
" \"\"\"\n",
" with open(filename) as f:\n",
" contents = f.read()\n",
" for para in re.split(r\"[\\n\\s*]{2,}\", contents):\n",
" try:\n",
" if para[0] != \"[\" and not para.split()[0].isupper():\n",
" yield para\n",
" except IndexError:\n",
" continue"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split up each text into train/validation data frames. Build two master data frames of train and validation data."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"train_frac = .7\n",
"\n",
"train_master = pd.DataFrame()\n",
"valid_master = pd.DataFrame()\n",
"\n",
"for i, fn in enumerate(gutenberg_filenames):\n",
" paras = list()\n",
" para_iterator = gutenberg_iterator(fn)\n",
" for para in para_iterator:\n",
" paras.append([i, para])\n",
" paras_df = pd.DataFrame(paras, columns=[\"label\", \"text\"])\n",
" train_ix = paras_df.sample(frac=train_frac).index\n",
" train_mask = paras_df.index.isin(train_ix)\n",
" train_master = pd.concat([train_master, paras_df.loc[train_mask]])\n",
" valid_master = pd.concat([valid_master, paras_df.iloc[~train_mask]])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"train_master.to_csv(\"gutenberg_train.csv\", index=False)\n",
"valid_master.to_csv(\"gutenberg_valid.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save a legend of filename-index pairs. \n",
"We'll use the serial integer indexes for the label column in the data frames we pass into `Learner`, but we'll want to replace those with readable labels, later on."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"pd.Series([fn.name.split(\".\")[0] for fn in gutenberg_filenames], name=\"text\")\\\n",
" .reset_index()\\\n",
" .to_csv(\"label_key.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load `train` and `valid` data frames."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"train = pd.read_csv(\"gutenberg_train.csv\")\n",
"valid = pd.read_csv(\"gutenberg_valid.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a bit of a class imbalance issue, which will make the use of a raw accuracy metric somewhat misleading. \n",
"We could use the `f_beta` metric instead, although this current notebook hasn't incorporated that yet. For now, we can just keep in mind which are the over-/under-represented classes, and eyeball the resulting confusion matrix to see if predictive accuracy is skewed or tends to vote for majority classes."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3 3797\n",
"11 3272\n",
"10 2411\n",
"21 2330\n",
"24 2279\n",
"23 2141\n",
"15 1657\n",
"14 1541\n",
"25 1461\n",
"0 1415\n",
"12 1209\n",
"16 1165\n",
"18 972\n",
"4 758\n",
"2 752\n",
"5 523\n",
"6 388\n",
"19 362\n",
"13 338\n",
"7 317\n",
"9 271\n",
"1 270\n",
"20 256\n",
"8 241\n",
"22 67\n",
"17 57\n",
"Name: label, dtype: int64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"valid.label.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Build a language model across all Gutenberg texts\n",
"\n",
"This section largely follows `lesson3-imdb` from the fast.ai Fall 2018 DL1 course. \n",
"One difference is that it uses the `from_df()` factory method when creating a `DataBunch`. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"path = pathlib.Path(\".\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"data_lm = TextDataBunch.from_df(path, train_df=train, valid_df=valid, text_cols=\"text\", label_cols=\"label\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"data_lm.save(\"tmp_lm\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['xxunk', 'xxpad', ',', '\\n', 'the', '1', 'xxbos', 'xxfld', '.', 'and']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_lm.vocab.itos[:10]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text xxbos xxfld 1 the missouri negro \n",
" dialect ; the extremest form of the backwoods xxunk dialect ; the \n",
" ordinary \" pike county \" dialect ; and four modified varieties of this last . \n",
" the xxunk have not been done in a haphazard fashion , or by xxunk ; \n",
" but xxunk , and with the trustworthy guidance and support of \n",
" personal familiarity with these several forms of speech ."
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_lm.train_ds[0][0]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"bs = 48"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# I'm not sure we need a TextDataBunch first and then a TextLMDataBunch - this is just following the Lesson 3 code.\n",
"data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs = bs)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.3)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n"
]
}
],
"source": [
"learn.lr_find()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"learn.recorder.plot(skip_end=15)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 01:18\n",
"epoch train_loss valid_loss accuracy\n",
"1 5.316948 5.060972 0.212872 (01:18)\n",
"\n"
]
}
],
"source": [
"learn.fit_one_cycle(1, 3e-1, moms=(0.8,0.7))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"learn.save('fit_head')"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 15:07\n",
"epoch train_loss valid_loss accuracy\n",
"1 4.725402 4.617736 0.260116 (01:30)\n",
"2 4.486015 4.399654 0.278521 (01:30)\n",
"3 4.359496 4.273832 0.287831 (01:30)\n",
"4 4.216336 4.164670 0.294202 (01:31)\n",
"5 4.098336 4.081409 0.302391 (01:30)\n",
"6 3.995981 4.029743 0.308258 (01:30)\n",
"7 3.881822 3.990638 0.312243 (01:30)\n",
"8 3.756205 3.983469 0.313753 (01:30)\n",
"9 3.662281 3.987786 0.314340 (01:30)\n",
"10 3.613561 3.998944 0.314088 (01:30)\n",
"\n"
]
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(10, 1e-2, moms=(0.8,0.7))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"learn.save_encoder('fine_tuned_enc')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create text classifier using trained encoder"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table> <col width='90%'> <col width='10%'> <tr>\n",
" <th>text</th>\n",
" <th>label</th>\n",
" </tr>\n",
" <tr>\n",
" <th>xxbos xxfld 1 now he lapsed into suffering again , as the dry argument was resumed . \\n presently he bethought him of a treasure he had and got it out . it was \\n a large black beetle with formidable jaws -- a \" pinchbug , \" he called</th>\n",
" <th>5</th>\n",
" </tr>\n",
" <tr>\n",
" <th>xxbos xxfld 1 circumstances that might swell to half an hour 's relation , \\n and contained multiplied proofs to her who had seen them , had passed \\n xxunk by her who now heard them ; but the two latest occurrences \\n to be mentioned , the two of</th>\n",
" <th>25</th>\n",
" </tr>\n",
" <tr>\n",
" <th>xxbos xxfld 1 \" hang the boy , ca n't i never learn anything ? ai n't he played me tricks \\n enough like that for me to be looking out for him by this time ? but old \\n fools is the biggest fools there is . ca n't</th>\n",
" <th>5</th>\n",
" </tr>\n",
" <tr>\n",
" <th>xxbos xxfld 1 mur . wherefore reioyce ? \\n what conquest brings he home ? \\n what xxunk follow him to rome , \\n to grace in xxunk bonds his chariot xxunk ? \\n you xxunk , you stones , you worse then xxunk things : \\n o you hard</th>\n",
" <th>9</th>\n",
" </tr>\n",
" <tr>\n",
" <th>xxbos xxfld 1 her brother 's return was the first comfort ; he could take best care \\n of his wife ; and the second blessing was the arrival of the apothecary . \\n till he came and had examined the child , their apprehensions were \\n the worse for</th>\n",
" <th>4</th>\n",
" </tr>\n",
"</table>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data_clas = TextClasDataBunch.load(path)\n",
"data_clas.show_batch()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"learn = text_classifier_learner(data_clas, drop_mult=0.5)\n",
"learn.load_encoder('fine_tuned_enc')\n",
"learn.freeze()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n"
]
}
],
"source": [
"learn.lr_find()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"learn.recorder.plot()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 00:49\n",
"epoch train_loss valid_loss accuracy\n",
"1 1.669107 1.360316 0.554678 (00:49)\n",
"\n"
]
}
],
"source": [
"learn.fit_one_cycle(1, 5e-2, moms=(0.8,0.7))"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"learn.save('first_gut')"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 01:02\n",
"epoch train_loss valid_loss accuracy\n",
"1 1.247227 1.038880 0.664926 (01:02)\n",
"\n"
]
}
],
"source": [
"learn.freeze_to(-2)\n",
"learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"learn.save('second_gut')"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 01:35\n",
"epoch train_loss valid_loss accuracy\n",
"1 1.097242 0.934629 0.694545 (01:35)\n",
"\n"
]
}
],
"source": [
"learn.freeze_to(-3)\n",
"learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"learn.save('third_gut')"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total time: 04:22\n",
"epoch train_loss valid_loss accuracy\n",
"1 1.026059 0.902645 0.703074 (02:09)\n",
"2 0.996181 0.905725 0.707372 (02:12)\n",
"\n"
]
}
],
"source": [
"learn.unfreeze()\n",
"learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With just a bit of training, we achieve 71% classification accuracy over 26 classes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Understanding our model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can look at classification accuracy using the `ClassificationInterpretation` class from the `fastai.vision` module."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"interp = ClassificationInterpretation.from_learner(learn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the index-text legend. This will allow us to replace the numeric classes with more readable text labels."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>twain-huckleberry_finn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>blake-poems</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>chesterton-brown</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>dickens-ncklb10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>austen-persuasion</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index text\n",
"0 0 twain-huckleberry_finn\n",
"1 1 blake-poems\n",
"2 2 chesterton-brown\n",
"3 3 dickens-ncklb10\n",
"4 4 austen-persuasion"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"label_key = pd.read_csv(\"label_key.csv\")\n",
"label_key_dict = {x[\"index\"]: x[\"text\"] for x in label_key.to_dict(orient=\"records\")}\n",
"\n",
"label_key.head()"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"confusion = interp.confusion_matrix()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `plot_confusion_matrix()` creates a `plt` object, which can be adjusted simply by referring to `plt` afterwards. Here, we use this knowledge to replace the tick labels (numeric indices) on the confusion matrix with text labels."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 840x840 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"interp.plot_confusion_matrix(figsize=(14,14), dpi=60)\n",
"_ = plt.xticks(np.arange(26), label_key.text.values)\n",
"_ = plt.yticks(np.arange(26), label_key.text.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall the class imbalance we discovered earlier - it looks like even for classes with very low sample size, accuracy is pretty good and it's not terribly skewed to majority class predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `most_confused` method gives a clearer sense for where the model goes wrong. The most frequent misses are, unsurprisingly, when one author has multiple texts and the model predicts the right author, but the wrong book. More interestingly, perhaps, is that authors with similar writing styles (Blake and Milton, for example) are confused, whereas authors with very different styles (Lewis Carroll vs Charles Dickens) are rarely confused. "
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Actual</th>\n",
" <th>Predicted</th>\n",
" <th>n_errors</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>dickens-ncklb10</td>\n",
" <td>dickens-olivr11</td>\n",
" <td>389</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>dickens-olivr11</td>\n",
" <td>dickens-ncklb10</td>\n",
" <td>389</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>christie-masac11</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>180</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>christie-secad10</td>\n",
" <td>christie-masac11</td>\n",
" <td>167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>edgeworth-parents</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>162</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>blake-poems</td>\n",
" <td>milton-paradise</td>\n",
" <td>158</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>christie-masac11</td>\n",
" <td>christie-secad10</td>\n",
" <td>157</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>austen-persuasion</td>\n",
" <td>austen-emma</td>\n",
" <td>129</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>dickens-grexp10</td>\n",
" <td>christie-masac11</td>\n",
" <td>128</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>125</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>dickens-ncklb10</td>\n",
" <td>edgeworth-parents</td>\n",
" <td>115</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>dickens-ncklb10</td>\n",
" <td>melville-moby_dick</td>\n",
" <td>113</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>dickens-grexp10</td>\n",
" <td>edgeworth-parents</td>\n",
" <td>112</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>dickens-ncklb10</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>edgeworth-parents</td>\n",
" <td>101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>shakespeare-macbeth</td>\n",
" <td>shakespeare-caesar</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>edgeworth-parents</td>\n",
" <td>melville-moby_dick</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>christie-secad10</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>whitman-leaves</td>\n",
" <td>milton-paradise</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>chesterton-brown</td>\n",
" <td>chesterton-ball</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>chesterton-ball</td>\n",
" <td>melville-moby_dick</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>austen-sense</td>\n",
" <td>austen-emma</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>dickens-grexp10</td>\n",
" <td>christie-secad10</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>dickens-grexp10</td>\n",
" <td>melville-moby_dick</td>\n",
" <td>86</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>chesterton-ball</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>edgeworth-parents</td>\n",
" <td>christie-secad10</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>carroll-looking-glass</td>\n",
" <td>carroll-alice</td>\n",
" <td>82</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>christie-secad10</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>austen-emma</td>\n",
" <td>austen-sense</td>\n",
" <td>79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>244</th>\n",
" <td>whitman-leaves</td>\n",
" <td>dickens-ncklb10</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>245</th>\n",
" <td>whitman-leaves</td>\n",
" <td>christie-masac11</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>246</th>\n",
" <td>blake-songs</td>\n",
" <td>dickens-olivr11</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>247</th>\n",
" <td>austen-sense</td>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>248</th>\n",
" <td>austen-sense</td>\n",
" <td>chesterton-brown</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249</th>\n",
" <td>austen-sense</td>\n",
" <td>melville-moby_dick</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>250</th>\n",
" <td>shakespeare-hamlet</td>\n",
" <td>whitman-leaves</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>251</th>\n",
" <td>burgess-busterbrown</td>\n",
" <td>twain-tom_sawyer</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>252</th>\n",
" <td>dickens-olivr11</td>\n",
" <td>austen-sense</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>253</th>\n",
" <td>dickens-olivr11</td>\n",
" <td>carroll-alice</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>254</th>\n",
" <td>austen-emma</td>\n",
" <td>twain-tom_sawyer</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>255</th>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>chesterton-brown</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>256</th>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>milton-paradise</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257</th>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>whitman-leaves</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258</th>\n",
" <td>blake-poems</td>\n",
" <td>chesterton-ball</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>259</th>\n",
" <td>austen-persuasion</td>\n",
" <td>dickens-olivr11</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>260</th>\n",
" <td>bryant-stories</td>\n",
" <td>christie-masac11</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>261</th>\n",
" <td>carroll-looking-glass</td>\n",
" <td>twain-huckleberry_finn</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>262</th>\n",
" <td>carroll-looking-glass</td>\n",
" <td>dickens-grexp10</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>263</th>\n",
" <td>edgeworth-parents</td>\n",
" <td>shakespeare-hamlet</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>264</th>\n",
" <td>whitman-leaves</td>\n",
" <td>austen-sense</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>265</th>\n",
" <td>chesterton-ball</td>\n",
" <td>austen-persuasion</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>266</th>\n",
" <td>chesterton-ball</td>\n",
" <td>twain-tom_sawyer</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>267</th>\n",
" <td>chesterton-ball</td>\n",
" <td>shakespeare-caesar</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>268</th>\n",
" <td>dickens-grexp10</td>\n",
" <td>austen-persuasion</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>269</th>\n",
" <td>burgess-busterbrown</td>\n",
" <td>milton-paradise</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>270</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>bryant-stories</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>271</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>shakespeare-macbeth</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>272</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>blake-songs</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>273</th>\n",
" <td>melville-moby_dick</td>\n",
" <td>austen-sense</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>274 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" Actual Predicted n_errors\n",
"0 dickens-ncklb10 dickens-olivr11 389\n",
"1 dickens-olivr11 dickens-ncklb10 389\n",
"2 christie-masac11 dickens-grexp10 180\n",
"3 christie-secad10 christie-masac11 167\n",
"4 edgeworth-parents dickens-grexp10 162\n",
"5 blake-poems milton-paradise 158\n",
"6 christie-masac11 christie-secad10 157\n",
"7 austen-persuasion austen-emma 129\n",
"8 dickens-grexp10 christie-masac11 128\n",
"9 melville-moby_dick dickens-grexp10 125\n",
"10 dickens-ncklb10 edgeworth-parents 115\n",
"11 dickens-ncklb10 melville-moby_dick 113\n",
"12 dickens-grexp10 edgeworth-parents 112\n",
"13 dickens-ncklb10 dickens-grexp10 111\n",
"14 melville-moby_dick edgeworth-parents 101\n",
"15 shakespeare-macbeth shakespeare-caesar 100\n",
"16 edgeworth-parents melville-moby_dick 100\n",
"17 christie-secad10 dickens-grexp10 100\n",
"18 whitman-leaves milton-paradise 95\n",
"19 chesterton-brown chesterton-ball 94\n",
"20 chesterton-ball melville-moby_dick 94\n",
"21 austen-sense austen-emma 90\n",
"22 dickens-grexp10 christie-secad10 88\n",
"23 dickens-grexp10 melville-moby_dick 86\n",
"24 twain-huckleberry_finn dickens-grexp10 85\n",
"25 chesterton-ball dickens-grexp10 85\n",
"26 edgeworth-parents christie-secad10 83\n",
"27 carroll-looking-glass carroll-alice 82\n",
"28 twain-huckleberry_finn christie-secad10 81\n",
"29 austen-emma austen-sense 79\n",
".. ... ... ...\n",
"244 whitman-leaves dickens-ncklb10 4\n",
"245 whitman-leaves christie-masac11 4\n",
"246 blake-songs dickens-olivr11 4\n",
"247 austen-sense twain-huckleberry_finn 4\n",
"248 austen-sense chesterton-brown 4\n",
"249 austen-sense melville-moby_dick 4\n",
"250 shakespeare-hamlet whitman-leaves 4\n",
"251 burgess-busterbrown twain-tom_sawyer 4\n",
"252 dickens-olivr11 austen-sense 4\n",
"253 dickens-olivr11 carroll-alice 4\n",
"254 austen-emma twain-tom_sawyer 4\n",
"255 twain-huckleberry_finn chesterton-brown 3\n",
"256 twain-huckleberry_finn milton-paradise 3\n",
"257 twain-huckleberry_finn whitman-leaves 3\n",
"258 blake-poems chesterton-ball 3\n",
"259 austen-persuasion dickens-olivr11 3\n",
"260 bryant-stories christie-masac11 3\n",
"261 carroll-looking-glass twain-huckleberry_finn 3\n",
"262 carroll-looking-glass dickens-grexp10 3\n",
"263 edgeworth-parents shakespeare-hamlet 3\n",
"264 whitman-leaves austen-sense 3\n",
"265 chesterton-ball austen-persuasion 3\n",
"266 chesterton-ball twain-tom_sawyer 3\n",
"267 chesterton-ball shakespeare-caesar 3\n",
"268 dickens-grexp10 austen-persuasion 3\n",
"269 burgess-busterbrown milton-paradise 3\n",
"270 melville-moby_dick bryant-stories 3\n",
"271 melville-moby_dick shakespeare-macbeth 3\n",
"272 melville-moby_dick blake-songs 3\n",
"273 melville-moby_dick austen-sense 3\n",
"\n",
"[274 rows x 3 columns]"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labeled_classes = [(label_key_dict[x[0]], label_key_dict[x[1]], x[2]) \n",
" for x in interp.most_confused(min_val=2)]\n",
"pd.DataFrame(labeled_classes, columns=[\"Actual\", \"Predicted\", \"n_errors\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment