mattiasostmar/fasttext_jung_sensing_intuition_functions_in_blogs

## fasttext_jung_sensing_intuition_functions_in_blogs
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Author: **Mattias Östmar**\n",
    "\n",
    "Date: **2019-03-15**\n",
    "\n",
    "Contact: **mattiasostmar at gmail dot com**\n",
    "\n",
    "Thanks to Mikael Huss for being a good speaking partner.\n",
    "\n",
    "In this notebook we're going to use the [python version of fasttext](https://pypi.org/project/fasttext/), based on [Facebooks fasttext](https://github.com/facebookresearch/fastText) tool, to try to predict two opposite functions, sensing (s) and intuition (n), of the [Jungian cognitive function](https://en.wikipedia.org/wiki/Jungian_cognitive_functions) of the authors writing style as appearing in blog posts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import requests\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import fasttext"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Download the annotated dataset as semi-colon separated CSV from [https://osf.io/zvw5g/download](https://osf.io/zvw5g/download) (66,1 MB file size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>base_function</th>\n",
       "      <th>directed_function</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>❀*a drop of colour*❀ 1/39 next→ home ask past ...</td>\n",
       "      <td>f</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Neko cool kids can't die home family daveblog ...</td>\n",
       "      <td>t</td>\n",
       "      <td>ti</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Anything... Anything Mass Effect-related Music...</td>\n",
       "      <td>f</td>\n",
       "      <td>fe</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text base_function  \\\n",
       "1  ❀*a drop of colour*❀ 1/39 next→ home ask past ...             f   \n",
       "2  Neko cool kids can't die home family daveblog ...             t   \n",
       "3  Anything... Anything Mass Effect-related Music...             f   \n",
       "\n",
       "  directed_function  \n",
       "1                fi  \n",
       "2                ti  \n",
       "3                fe  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"blog_texts_and_cognitive_function.csv\", sep=\";\", index_col=0)\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 22588 entries, 1 to 25437\n",
      "Data columns (total 3 columns):\n",
      "text                 22588 non-null object\n",
      "base_function        22588 non-null object\n",
      "directed_function    22588 non-null object\n",
      "dtypes: object(3)\n",
      "memory usage: 705.9+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many examples do we have in each class in the original dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "n    9380\n",
       "f    6063\n",
       "t    4502\n",
       "s    2643\n",
       "Name: base_function, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.base_function.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see, crudely, if the blog writers of a certain class writes longer or shorter texts in average."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>base_function</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>f</th>\n",
       "      <td>476.125869</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>n</th>\n",
       "      <td>489.926113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>s</th>\n",
       "      <td>488.566448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>t</th>\n",
       "      <td>508.435853</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 text_len\n",
       "base_function            \n",
       "f              476.125869\n",
       "n              489.926113\n",
       "s              488.566448\n",
       "t              508.435853"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = []\n",
    "df.text.apply(lambda x: tokens.append(len(x.split())))\n",
    "df[\"text_len\"] = pd.Series(tokens)\n",
    "df.groupby(\"base_function\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Let's try to predict the two cognitive functions thinking and feeling respectively. We need to remove the other labels and prepare the labels to suite fasttexts formatting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/mos/miniconda3/envs/nlp/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  from ipykernel import kernelapp as app\n",
      "/Users/mos/miniconda3/envs/nlp/lib/python3.4/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  app.launch_new_instance()\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>-Only One of Many- Follow on Tumblr Ask me eve...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>__label__s</td>\n",
       "      <td>noon's house home archive message about art → ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>№.7 №.7 Contact Archive About Next sorest個人htt...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         label                                               text\n",
       "11  __label__n  -Only One of Many- Follow on Tumblr Ask me eve...\n",
       "16  __label__s  noon's house home archive message about art → ...\n",
       "18  __label__n  №.7 №.7 Contact Archive About Next sorest個人htt..."
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = df[[\"base_function\",\"text\"]]\n",
    "dataset[\"label\"] = df.base_function.apply(lambda x: \"__label__\" + x)\n",
    "dataset.drop(\"base_function\", axis=1, inplace=True)\n",
    "dataset = dataset[[\"label\",\"text\"]]\n",
    "dataset = dataset[(dataset.label == \"__label__s\") | (dataset.label == \"__label__n\")] # select only labels s and n\n",
    "dataset.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure we have the correct number of samples for each remaining class in the sampled dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "__label__n    2643\n",
       "__label__s    2643\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped = dataset.groupby(\"label\")\n",
    "sample = grouped.apply(lambda x: x.sample(n=2643)) # We have 2643 samples in class s and 9380 in n\n",
    "sample.label.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A visual sanity check of the data to see that we have the correct classes in the sample dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__n</th>\n",
       "      <th>3543</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>Black Market Beauty earthdad : my goal in life...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25030</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>Just Like Yesterday 1.5M ratings 277k ratings ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15351</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>promises to keep and miles to go before I sleep</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__n 3543   __label__n   \n",
       "           25030  __label__n   \n",
       "           15351  __label__n   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__n 3543   Black Market Beauty earthdad : my goal in life...  \n",
       "           25030  Just Like Yesterday 1.5M ratings 277k ratings ...  \n",
       "           15351  promises to keep and miles to go before I sleep    "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__s</th>\n",
       "      <th>21135</th>\n",
       "      <td>__label__s</td>\n",
       "      <td>It's just gone noon, About me It's just gone n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21211</th>\n",
       "      <td>__label__s</td>\n",
       "      <td>( ˘ ³˘)♥ ( ˘ ³˘)♥ mbti actual names » gardevor...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9272</th>\n",
       "      <td>__label__s</td>\n",
       "      <td>Strange Things Are Afoot At The Circle K Stran...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__s 21135  __label__s   \n",
       "           21211  __label__s   \n",
       "           9272   __label__s   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__s 21135  It's just gone noon, About me It's just gone n...  \n",
       "           21211  ( ˘ ³˘)♥ ( ˘ ³˘)♥ mbti actual names » gardevor...  \n",
       "           9272   Strange Things Are Afoot At The Circle K Stran...  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's separate the dataset into two separate files for 80 per cent training and 20 per cent evaluation respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows in training data: 4228\n",
      "Rows in test data: 1058\n"
     ]
    }
   ],
   "source": [
    "# See # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n",
    "train, test = train_test_split(sample, test_size=0.2)\n",
    "print(\"Rows in training data: {}\".format(len(train)))\n",
    "print(\"Rows in test data: {}\".format(len(test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create two separate textfiles for the training and evaluation respectively, with each row containing the label and the text according to fasttexts formatting standards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train.to_csv(r'jung_sn_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")\n",
    "test.to_csv(r'jung_sn_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can train our model with the default settings and no text preprocessing to get an initial setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "classifier1 = fasttext.supervised(\"jung_sn_training.txt\",\"model_jung_sn_default\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can evaluate the model using our test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5151228733459358\n",
      "R@1: 0.5151228733459358\n",
      "Number of examples: 1058\n"
     ]
    }
   ],
   "source": [
    "result = classifier1.test(\"jung_sn_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are 1.5% better than pure chance (0.5). Let's see if we can improve the model by some crude preprocessing of the texts, removing non-alphanumeric characters and making all words lowercase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__n</th>\n",
       "      <th>3543</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>black market beauty earthdad my goal in life i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25030</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>just like yesterday 1 5m ratings 277k ratings ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15351</th>\n",
       "      <td>__label__n</td>\n",
       "      <td>promises to keep and miles to go before i sleep</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__n 3543   __label__n   \n",
       "           25030  __label__n   \n",
       "           15351  __label__n   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__n 3543   black market beauty earthdad my goal in life i...  \n",
       "           25030  just like yesterday 1 5m ratings 277k ratings ...  \n",
       "           15351   promises to keep and miles to go before i sleep   "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed = sample.copy()\n",
    "processed[\"text\"] = processed.text.str.replace(r\"[\\W ]\",\" \") # replace all characters that are not a-z, A-Z or 0-9\n",
    "processed[\"text\"] = processed.text.str.lower() # make all characters lower case\n",
    "processed[\"text\"] = processed.text.str.replace(r' +',' ') # Remove multiple spaces\n",
    "processed[\"text\"] = processed.text.str.replace(r'^ +','') # Remove resulting initial spaces\n",
    "\n",
    "processed.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then we create training and evaluation data from the processed dataframe and store them to two new files with the prefix \"processed_\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows in training data: 4228\n",
      "Rows in test data: 1058\n"
     ]
    }
   ],
   "source": [
    "train, test = train_test_split(processed, test_size=0.2)\n",
    "print(\"Rows in training data: {}\".format(len(train)))\n",
    "print(\"Rows in test data: {}\".format(len(test)))\n",
    "\n",
    "train.to_csv(r'processed_jung_sn_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")\n",
    "test.to_csv(r'processed_jung_sn_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And re-run the training and evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.497164461247637\n",
      "R@1: 0.497164461247637\n",
      "Number of examples: 1058\n"
     ]
    }
   ],
   "source": [
    "classifier2 = fasttext.supervised(\"processed_jung_sn_training.txt\",\"model_jung_sn_processed\")\n",
    "result = classifier2.test(\"processed_jung_sn_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preprocessing actually makes the results worse than pure chance. Apparently capital letters and special characters are features that help distinguish between the different labels, so let's keep the original trainingdata for further training and tuning.\n",
    "\n",
    "What happens if we increase the number of epochs from the default 5 epochs to 10?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5085066162570888\n",
      "R@1: 0.5085066162570888\n",
      "Number of examples: 1058\n"
     ]
    }
   ],
   "source": [
    "classifier3 = fasttext.supervised(\"jung_sn_training.txt\", \"model_jung_sn_default_25epochs\", epoch=10)\n",
    "result = classifier3.test(\"jung_sn_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results deteriorate slightly from 0.515 to 0.508."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if we increase the learning rate from default 0.05 to 0.1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5103969754253308\n",
      "R@1: 0.5103969754253308\n",
      "Number of examples: 1058\n"
     ]
    }
   ],
   "source": [
    "classifier4 = fasttext.supervised(\"jung_sn_training.txt\", \"model_jung_sn_default_lr0.1\", lr=0.1)\n",
    "result = classifier4.test(\"jung_sn_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A deterioration from original 0.515 to 0.510."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we use pre-trained vectors when building the classifier? They can be downloaded from [fasttext.cc](https://fasttext.cc/docs/en/english-vectors.html). For this approach I download the C++ version of fasttext from [https://github.com/facebookresearch/fastTex](https://github.com/facebookresearch/fastTex) and run it in the terminal. CD into the downloaded directory with the training and evaluation text files in the previous directory and this code should work for you.\n",
    "\n",
    "Let's train on the preprocessed texts again using the largest pre-trained vectors with subword information and keep the learning rate of 0.1 which improved the results slightly. Note that we have to match the vector size of the pre-trained vector file by increasing the dimensions to 300 from the default 100."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "fastText-0.2.0 $ ./fasttext supervised -input ../processed_jung_sn_training.txt -output ../model_jung_sn_processed_crawl-300d-2M-subword -dim 300 -verbose 3 -lr 0.1 -pretrainedVectors ./crawl-300d-2M-subword/crawl-300d-2M-subword.vec\n",
    "Read 3M words\n",
    "Number of words:  145831\n",
    "Number of labels: 2\n",
    "\n",
    "fastText-0.2.0 $ fastText-0.2.0 $ ./fasttext test ../model_jung_sn_processed_crawl-300d-2M-subword.bin ../processed_jung_sn_evaluation.txt\n",
    "N\t1058\n",
    "P@1\t0.492\n",
    "R@1\t0.492"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training and evaluation on preprocessed texts actually perform worse than baseline.\n",
    "\n",
    "Let's also train an evaluate on the original texts without preprocessing."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "fastText-0.2.0 $ ./fasttext supervised -input ../jung_sn_training.txt -output ../model_jung_sn_crawl-300d-2M-subword -dim 300 -verbose 3 -lr 0.1 -pretrainedVectors ./crawl-300d-2M-subword/crawl-300d-2M-subword.vec\n",
    "Read 2M words\n",
    "Number of words:  224430\n",
    "Number of labels: 2\n",
    "Progress: 100.0% words/sec/thread:  907810 lr:  0.000000 loss:  0.678745 ETA:   0h 0m\n",
    "\n",
    "fastText-0.2.0 $ ./fasttext test ../model_jung_sn_crawl-300d-2M-subword.bin ../jung_sn_evaluation.txt\n",
    "N\t1058\n",
    "P@1\t0.486\n",
    "R@1\t0.486\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are still worse than chance. Let's try the model trained in preprocessed text on the evaluation texts that are not preprocessed."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "./fasttext test ../model_jung_sn_processed_crawl-300d-2M-subword.bin ../jung_sn_evaluation.txt\n",
    "N\t1058\n",
    "P@1\t0.545\n",
    "R@1\t0.545"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we get the best results so far, but still only 4.5% better than chance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}