mattiasostmar/fasttext_jung_thinking_feeling_functions_in_blogs

## fasttext_jung_thinking_feeling_functions_in_blogs
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Author: **Mattias Östmar**\n",
    "\n",
    "Date: **2019-03-15**\n",
    "\n",
    "Contact: **mattiasostmar at gmail dot com**\n",
    "\n",
    "Thanks to Mikael Huss for being a good speaking partner.\n",
    "\n",
    "In this notebook we're going to use the [python version of fasttext](https://pypi.org/project/fasttext/), based on [Facebooks fasttext](https://github.com/facebookresearch/fastText) tool, to try to predict two opposite functions, thinking (t) and feeling (f), of the [Jungian cognitive function](https://en.wikipedia.org/wiki/Jungian_cognitive_functions) of the authors writing style as appearing in blog posts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import requests\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import fasttext"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Download the annotated dataset as semi-colon separated CSV from [https://osf.io/zvw5g/download](https://osf.io/zvw5g/download) (66,1 MB file size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>base_function</th>\n",
       "      <th>directed_function</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>❀*a drop of colour*❀ 1/39 next→ home ask past ...</td>\n",
       "      <td>f</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Neko cool kids can't die home family daveblog ...</td>\n",
       "      <td>t</td>\n",
       "      <td>ti</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Anything... Anything Mass Effect-related Music...</td>\n",
       "      <td>f</td>\n",
       "      <td>fe</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text base_function  \\\n",
       "1  ❀*a drop of colour*❀ 1/39 next→ home ask past ...             f   \n",
       "2  Neko cool kids can't die home family daveblog ...             t   \n",
       "3  Anything... Anything Mass Effect-related Music...             f   \n",
       "\n",
       "  directed_function  \n",
       "1                fi  \n",
       "2                ti  \n",
       "3                fe  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"blog_texts_and_cognitive_function.csv\", sep=\";\", index_col=0)\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 22588 entries, 1 to 25437\n",
      "Data columns (total 3 columns):\n",
      "text                 22588 non-null object\n",
      "base_function        22588 non-null object\n",
      "directed_function    22588 non-null object\n",
      "dtypes: object(3)\n",
      "memory usage: 705.9+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many examples do we have in each class in the original dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "n    9380\n",
       "f    6063\n",
       "t    4502\n",
       "s    2643\n",
       "Name: base_function, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.base_function.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see, crudely, if the blog writers of a certain class writes longer or shorter texts in average."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>base_function</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>f</th>\n",
       "      <td>476.125869</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>n</th>\n",
       "      <td>489.926113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>s</th>\n",
       "      <td>488.566448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>t</th>\n",
       "      <td>508.435853</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 text_len\n",
       "base_function            \n",
       "f              476.125869\n",
       "n              489.926113\n",
       "s              488.566448\n",
       "t              508.435853"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = []\n",
    "df.text.apply(lambda x: tokens.append(len(x.split())))\n",
    "df[\"text_len\"] = pd.Series(tokens)\n",
    "df.groupby(\"base_function\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Let's try to predict the two cognitive functions thinking and feeling respectively. We need to remove the other labels and prepare the labels to suite fasttexts formatting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>❀*a drop of colour*❀ 1/39 next→ home ask past ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>__label__t</td>\n",
       "      <td>Neko cool kids can't die home family daveblog ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>Anything... Anything Mass Effect-related Music...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        label                                               text\n",
       "1  __label__f  ❀*a drop of colour*❀ 1/39 next→ home ask past ...\n",
       "2  __label__t  Neko cool kids can't die home family daveblog ...\n",
       "3  __label__f  Anything... Anything Mass Effect-related Music..."
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = df[[\"base_function\",\"text\"]]\n",
    "dataset[\"label\"] = df.base_function.apply(lambda x: \"__label__\" + x)\n",
    "dataset.drop(\"base_function\", axis=1, inplace=True)\n",
    "dataset = dataset[[\"label\",\"text\"]]\n",
    "dataset = dataset[(dataset.label == \"__label__f\") | (dataset.label == \"__label__t\")] # select only labels t and f\n",
    "dataset.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure we have the correct number of samples for each remaining class in the sampled dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "__label__f    4500\n",
       "__label__t    4500\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped = dataset.groupby(\"label\")\n",
    "sample = grouped.apply(lambda x: x.sample(n=4500)) # We have 4502 samples in class t and 6063 in f\n",
    "sample.label.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A visual sanity check of the data to see that we have the correct classes in the sample dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__f</th>\n",
       "      <th>8415</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>Land of Serenity and Trust ♑ Land of Serenity ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12995</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>Naaah Naaah Mary. 16. Human garbage monster. I...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22195</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>faun forever gone faun forever gone “Be kind, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__f 8415   __label__f   \n",
       "           12995  __label__f   \n",
       "           22195  __label__f   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__f 8415   Land of Serenity and Trust ♑ Land of Serenity ...  \n",
       "           12995  Naaah Naaah Mary. 16. Human garbage monster. I...  \n",
       "           22195  faun forever gone faun forever gone “Be kind, ...  "
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__t</th>\n",
       "      <th>19191</th>\n",
       "      <td>__label__t</td>\n",
       "      <td>Well There Ya Go Archive Ask me anything Well ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21651</th>\n",
       "      <td>__label__t</td>\n",
       "      <td>Distractions Abound index message archive ELIA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16338</th>\n",
       "      <td>__label__t</td>\n",
       "      <td>bagel aficionado bagel aficionado cecilia/18/m...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__t 19191  __label__t   \n",
       "           21651  __label__t   \n",
       "           16338  __label__t   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__t 19191  Well There Ya Go Archive Ask me anything Well ...  \n",
       "           21651  Distractions Abound index message archive ELIA...  \n",
       "           16338  bagel aficionado bagel aficionado cecilia/18/m...  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's separate the dataset into two separate files for 80 per cent training and 20 per cent evaluation respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows in training data: 7200\n",
      "Rows in test data: 1800\n"
     ]
    }
   ],
   "source": [
    "# See # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n",
    "train, test = train_test_split(sample, test_size=0.2)\n",
    "print(\"Rows in training data: {}\".format(len(train)))\n",
    "print(\"Rows in test data: {}\".format(len(test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create two separate textfiles for the training and evaluation respectively, with each row containing the label and the text according to fasttexts formatting standards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train.to_csv(r'jung_tf_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")\n",
    "test.to_csv(r'jung_tf_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can train our model with the default settings and no text preprocessing to get an initial setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "classifier1 = fasttext.supervised(\"jung_tf_training.txt\",\"model_jung_tf_default\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can evaluate the model using our test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5216666666666666\n",
      "R@1: 0.5216666666666666\n",
      "Number of examples: 1800\n"
     ]
    }
   ],
   "source": [
    "result = classifier1.test(\"jung_tf_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are slightly better than pure chance (0.5). Let's see if we can improve the model by some crude preprocessing of the texts, removing non-alphanumeric characters and making all words lowercase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">__label__f</th>\n",
       "      <th>8415</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>land of serenity and trust land of serenity an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12995</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>naaah naaah mary 16 human garbage monster i wa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22195</th>\n",
       "      <td>__label__f</td>\n",
       "      <td>faun forever gone faun forever gone be kind fo...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       label  \\\n",
       "label                          \n",
       "__label__f 8415   __label__f   \n",
       "           12995  __label__f   \n",
       "           22195  __label__f   \n",
       "\n",
       "                                                               text  \n",
       "label                                                                \n",
       "__label__f 8415   land of serenity and trust land of serenity an...  \n",
       "           12995  naaah naaah mary 16 human garbage monster i wa...  \n",
       "           22195  faun forever gone faun forever gone be kind fo...  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed = sample.copy()\n",
    "processed[\"text\"] = processed.text.str.replace(r\"[\\W ]\",\" \") # replace all characters that are not a-z, A-Z or 0-9\n",
    "processed[\"text\"] = processed.text.str.lower() # make all characters lower case\n",
    "processed[\"text\"] = processed.text.str.replace(r' +',' ') # Remove multiple spaces\n",
    "processed[\"text\"] = processed.text.str.replace(r'^ +','') # Remove resulting initial spaces\n",
    "\n",
    "processed.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then we create training and evaluation data from the processed dataframe and store them to two new files with the prefix \"processed_\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows in training data: 7200\n",
      "Rows in test data: 1800\n"
     ]
    }
   ],
   "source": [
    "train, test = train_test_split(processed, test_size=0.2)\n",
    "print(\"Rows in training data: {}\".format(len(train)))\n",
    "print(\"Rows in test data: {}\".format(len(test)))\n",
    "\n",
    "train.to_csv(r'processed_jung_tf_training.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")\n",
    "test.to_csv(r'processed_jung_tf_evaluation.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar=\"\", escapechar=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And re-run the training and evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5005555555555555\n",
      "R@1: 0.5005555555555555\n",
      "Number of examples: 1800\n"
     ]
    }
   ],
   "source": [
    "classifier2 = fasttext.supervised(\"processed_jung_tf_training.txt\",\"model_jung_tf_processed\")\n",
    "result = classifier2.test(\"processed_jung_tf_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preprocessing actually makes the results worse. Apparently capital letters and special characters are features that help distinguish between the different labels, so let's keep the original trainingdata for further training and tuning.\n",
    "\n",
    "What happens if we increase the number of epochs from the default 5 epochs to 10?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5244444444444445\n",
      "R@1: 0.5244444444444445\n",
      "Number of examples: 1800\n"
     ]
    }
   ],
   "source": [
    "classifier3 = fasttext.supervised(\"jung_tf_training.txt\", \"model_jung_tf_default_25epochs\", epoch=10)\n",
    "result = classifier3.test(\"jung_tf_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results improve slightly from 0.521 to 0.0.524."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if we also increase the learning rate from default 0.05 to 0.1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P@1: 0.5288888888888889\n",
      "R@1: 0.5288888888888889\n",
      "Number of examples: 1800\n"
     ]
    }
   ],
   "source": [
    "classifier4 = fasttext.supervised(\"jung_tf_training.txt\", \"model_jung_tf_default_lr0.1\", lr=0.1)\n",
    "result = classifier4.test(\"jung_tf_evaluation.txt\")\n",
    "print('P@1:', result.precision)\n",
    "print('R@1:', result.recall)\n",
    "print('Number of examples:', result.nexamples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A miniscule improvement from 0.521 to 0.528."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we use pre-trained vectors when building the classifier? They can be downloaded from [fasttext.cc](https://fasttext.cc/docs/en/english-vectors.html). For this approach I download the C++ version of fasttext from [https://github.com/facebookresearch/fastTex](https://github.com/facebookresearch/fastTex) and run it in the terminal. CD into the downloaded directory with the training and evaluation text files in the previous directory and this code should work for you.\n",
    "\n",
    "Let's train on the preprocessed texts again using the largest pre-trained vectors with subword information and keep the learning rate of 0.1 which improved the results slightly. Note that we have to match the vector size of the pre-trained vector file by increasing the dimensions to 300 from the default 100."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "fastText-0.2.0 $ ./fasttext supervised -input ../processed_jung_tf_training.txt -output ../model_jung_tf_processed_crawl-300d-2M-subword -dim 300 -verbose 3 -lr 0.1 -pretrainedVectors ./crawl-300d-2M-subword/crawl-300d-2M-subword.vec\n",
    "Read 3M words\n",
    "Number of words:  145831\n",
    "Number of labels: 2\n",
    "\n",
    "fastText-0.2.0 $ ./fasttext test ../model_jung_tf_processed_crawl-300d-2M-subword.bin ../processed_jung_tf_evaluation.txt\n",
    "N\t1800\n",
    "P@1\t0.502\n",
    "R@1\t0.502"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the preprocessed training and evaluation texts perform only 0.2 % better than chance.\n",
    "\n",
    "Let's try with the original texts without preprocessing."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "fastText-0.2.0 $ ./fasttext supervised -input ../jung_tf_training.txt -output ../model_jung_tf_crawl-300d-2M-subword -dim 300 -verbose 3 -lr 0.1 -pretrainedVectors ./crawl-300d-2M-subword/crawl-300d-2M-subword.vec\n",
    "Read 3M words\n",
    "Number of words:  330731\n",
    "Number of labels: 2\n",
    "Progress: 100.0% words/sec/thread:  903122 lr:  0.000000 loss:  0.684609 ETA:   0h 0m\n",
    "\n",
    "./fasttext test ../model_jung_tf_crawl-300d-2M-subword.bin ../jung_tf_evaluation.txt\n",
    "N\t1800\n",
    "P@1\t0.509\n",
    "R@1\t0.509"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is only 0.9 % better than chance, so it doesn't say very much about the predictability of blog authors Jungian cognitive function based on their writing style.\n",
    "\n",
    "However, if we use the model trained on preprocessed texts, but use the evaluation texts without preoprocessing, we get the best results of all, 0.543 in precision."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "/fasttext test ../model_jung_tf_processed_crawl-300d-2M-subword.bin ../jung_tf_evaluation.txt\n",
    "N\t1800\n",
    "P@1\t0.543\n",
    "R@1\t0.543"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}