amirziai/amir_ziai_mids_ML@S.ipynb

## amir_ziai_mids_ML@S.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>\n",
    "Amir Ziai<br>\n",
    "amir@ischool.berkeley.edu<br>\n",
    "UC Berkeley MIDS<br>\n",
    "Machine learning at scale course<br>\n",
    "Week 1 assignment<br>\n",
    "September 15, 2015\n",
    "</b>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Big data</b>: big data is characterized by datasets that are too large and complex to be analyzed by traditional methodologies. Typically three Vs (volume, velocity and variety) are used to capture different aspects of this complexity. For instance volume highlights that \"big data\" problems are of such large sizes that won't fit on a single machine or alternatively are not possible to analyze on a single machine.\n",
    "<br>\n",
    "<br>\n",
    "<b>Example:</b> I recently had to analyze a 500GB flat zipped file (CSV), the task was to first filter for a specific code and then to aggregate the filtered lines. Sequentially processing the file would have taken many days (possibly weeks). I ended up running multiple threads to write out lines that satisfied the criteria in parallel. Since the follow-up aggregation was associative I was able to process each thread's results separately and finally combine the results. The task took less than 24 hours."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.0.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The error between the actual and the predicted values can be decomposed into three parts: <b>bias, variance and irreducible error</b>.<br><br>Bias is the difference between the expected value of the model and the actual values. Variance is the difference between the expected value of the model and the predicted values.<br><br>Generally with more complexity (higher order polynomials in our case) bias shrinks and variance grows. A plot of train/test accuracy against model complexity can help with finding the proper degree for the polynomial. The point where we can get the least test error is desirable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "done\n"
     ]
    }
   ],
   "source": [
    "print \"done\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 464\r\n",
      "-rw-------@  1 amir  staff    2138 Aug 15 20:20 enronemail_README.txt\r\n",
      "-rw-------@  1 amir  staff    8367 Sep  3 16:25 hw1_instructions.txt\r\n",
      "-rw-------@  1 amir  staff  203978 Sep 15 13:24 enronemail_1h.txt\r\n",
      "-rwx--x--x@  1 amir  staff    2070 Sep 15 14:46 \u001b[31mpNaiveBayes.sh\u001b[m\u001b[m\r\n",
      "drwxr-xr-x@ 10 amir  staff     340 Sep 15 15:25 \u001b[34m..\u001b[m\u001b[m\r\n",
      "-rwxr-xr-x@  1 amir  staff    1473 Sep 15 17:53 \u001b[31mreducer.py\u001b[m\u001b[m\r\n",
      "-rw-r--r--@  1 amir  staff    2672 Sep 15 17:53 enronemail_1h.txt.output\r\n",
      "drwx------@  9 amir  staff     306 Sep 15 17:53 \u001b[34m.\u001b[m\u001b[m\r\n",
      "-rwxr-xr-x@  1 amir  staff     501 Sep 15 18:13 \u001b[31mmapper.py\u001b[m\u001b[m\r\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "os.chdir('/Users/amir/Dropbox (Personal)/ML@S!/HW1-Questions/')\n",
    "!ls -altr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting mapper.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mapper.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "file_name = sys.argv[1]\n",
    "word_input = sys.argv[2]\n",
    "\n",
    "# input comes from STDIN (standard input)\n",
    "for line in open(file_name, 'rb'):\n",
    "    # remove leading and trailing whitespace\n",
    "    line = line.strip()\n",
    "\n",
    "    if word_input in line:\n",
    "        print '%s\\t%s' % (word_input, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!chmod a+x mapper.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reducer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting reducer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile reducer.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "from operator import itemgetter\n",
    "import sys\n",
    "\n",
    "current_word = None\n",
    "current_count = 0\n",
    "word = None\n",
    "\n",
    "# input comes from STDIN\n",
    "for file_name in sys.argv[1:]:\n",
    "    for line in open(file_name, 'rb'):\n",
    "        # remove leading and trailing whitespace\n",
    "        line = line.strip()\n",
    "\n",
    "        try:\n",
    "            # parse the input we got from mapper.py\n",
    "            word, count = line.split('\\t', 1)\n",
    "\n",
    "            # convert count (currently a string) to int\n",
    "            try:\n",
    "                count = int(count)\n",
    "            except ValueError:\n",
    "                # count was not a number, so silently\n",
    "                # ignore/discard this line\n",
    "                continue\n",
    "\n",
    "            # this IF-switch only works because Hadoop sorts map output\n",
    "            # by key (here: word) before it is passed to the reducer\n",
    "            if current_word == word:\n",
    "                current_count += count\n",
    "            else:\n",
    "                if current_word:\n",
    "                    # write result to STDOUT\n",
    "                    print '%s\\t%s' % (current_word, current_count)\n",
    "                current_count = count\n",
    "                current_word = word\n",
    "        except:\n",
    "            pass\n",
    "\n",
    "# do not forget to output the last word if needed!\n",
    "if current_word == word:\n",
    "    print '%s\\t%s' % (current_word, current_count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod a+x reducer.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Running pNaiveBayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!./pNaiveBayes.sh 4 assistance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "assistance\t8\r\n"
     ]
    }
   ],
   "source": [
    "!cat enronemail_1h.txt.output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       8\r\n"
     ]
    }
   ],
   "source": [
    "!grep assistance enronemail_1h.txt | cut -d$'\\t' -f4 | grep assistance | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd  # will use for better output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting mapper.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mapper.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "file_name = sys.argv[1]\n",
    "word_input = sys.argv[2]\n",
    "\n",
    "for email in open(file_name, 'rb'):\n",
    "    email = email.strip()\n",
    "    items = email.split('\\t') \n",
    "    \n",
    "    # since we want the output of the reducer to include\n",
    "    # prediction for each email we need to map each email\n",
    "    uid = items[0]  # email unique id (first column)\n",
    "    spam = items[1]\n",
    "    count_input = email.count(word_input) # number of occurence of word\n",
    "    count_all = len(email.split()) # total number of words in email\n",
    "    \n",
    "    print '%s\\t%s\\t%s\\t%s' % (uid, spam, count_input, count_all)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reducer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting reducer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile reducer.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "# universal counts\n",
    "spams = 0\n",
    "total = 0\n",
    "words_spam = 0\n",
    "words_ham = 0\n",
    "input_spam = 0\n",
    "input_ham = 0\n",
    "\n",
    "output = []\n",
    "\n",
    "for file_name in sys.argv[1:]:\n",
    "    for email in open(file_name, 'rb'):\n",
    "        email = email.strip()\n",
    "\n",
    "        try:\n",
    "            # parse the input we got from mapper.py\n",
    "            uid, spam, count_input, count_all = email.split('\\t')\n",
    "            \n",
    "            spam = int(spam)\n",
    "            count_input = int(count_input)\n",
    "            count_all = int(count_all)\n",
    "            spams += spam\n",
    "            total += 1\n",
    "            \n",
    "            if spam == 1:\n",
    "                words_spam += count_all\n",
    "                input_spam += count_input\n",
    "            else:\n",
    "                words_ham += count_all\n",
    "                input_ham += count_input\n",
    "                \n",
    "            output.append({'id': uid, 'spam': spam, 'count': count_input})\n",
    "            \n",
    "        except Exception, e:\n",
    "            print e\n",
    "            pass\n",
    "\n",
    "# probabilities\n",
    "# smoothing\n",
    "# vocabulary has only a single token so the denominator is simply +1\n",
    "prior_spam = spams / float(total)\n",
    "prior_ham = 1 - prior_spam\n",
    "probability_input_spam = (1 + input_spam) / float((words_spam + 1))\n",
    "probability_input_ham = (1 + input_ham) / float((words_ham + 1))\n",
    "        \n",
    "# create outputs\n",
    "for out in output: \n",
    "    # manageable to just multiply (no overflow)\n",
    "    # will need to switch to log and addition for next parts\n",
    "    # raising the probability to the power of number of occurences\n",
    "    # so for count = 0 there's simply no effect (multiplied by 1)\n",
    "    probability_spam = prior_spam * (probability_input_spam) ** out['count']\n",
    "    probability_ham = prior_ham * (probability_input_ham) ** out['count']\n",
    "    prediction = 1 if probability_spam > probability_ham else 0\n",
    "    \n",
    "    print '%s\\t%s\\t%s' % (out['id'], out['spam'], prediction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!./pNaiveBayes.sh 4 assistance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_table('enronemail_1h.txt.output', header=None, names=['ID', 'TRUTH', 'CLASS'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>TRUTH</th>\n",
       "      <th>CLASS</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>0017.2004-08-01.BG</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>0017.2004-08-02.BG</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>0018.1999-12-14.kaminski</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>0018.2001-07-13.SA_and_HP</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>0018.2003-12-18.GP</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           ID  TRUTH  CLASS\n",
       "95         0017.2004-08-01.BG      1      0\n",
       "96         0017.2004-08-02.BG      1      0\n",
       "97   0018.1999-12-14.kaminski      0      0\n",
       "98  0018.2001-07-13.SA_and_HP      1      1\n",
       "99         0018.2003-12-18.GP      1      1"
      ]
     },
     "execution_count": 191,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction accuracy percentage: 60.00\n"
     ]
    }
   ],
   "source": [
    "print 'Prediction accuracy percentage: %.2f' % (100 * len(df[df.TRUTH == df.CLASS]) / float(len(df)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting mapper.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mapper.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "file_name = sys.argv[1]\n",
    "word_inputs = sys.argv[2:]\n",
    "\n",
    "# input comes from a file that split has created\n",
    "for email in open(file_name, 'rb'):\n",
    "    email = email.strip()\n",
    "    items = email.split('\\t') \n",
    "    \n",
    "    # outputs\n",
    "    uid = items[0]  # email unique id (first column)\n",
    "    spam = items[1]\n",
    "    count_all = len(email.split())\n",
    "\n",
    "    for word in word_inputs:\n",
    "        count_input = email.count(word)\n",
    "        print '%s\\t%s\\t%s\\t%s\\t%s' % \\\n",
    "        (uid, spam, count_all, word, count_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapper output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0001.1999-12-10.farmer\t0\t7\tassistance\t0\r\n",
      "0001.1999-12-10.farmer\t0\t7\tvalium\t0\r\n",
      "0001.1999-12-10.farmer\t0\t7\tenlargementWithATypo\t0\r\n",
      "0001.1999-12-10.kaminski\t0\t6\tassistance\t0\r\n",
      "0001.1999-12-10.kaminski\t0\t6\tvalium\t0\r\n",
      "0001.1999-12-10.kaminski\t0\t6\tenlargementWithATypo\t0\r\n",
      "0001.2000-01-17.beck\t0\t526\tassistance\t0\r\n",
      "0001.2000-01-17.beck\t0\t526\tvalium\t0\r\n",
      "0001.2000-01-17.beck\t0\t526\tenlargementWithATypo\t0\r\n",
      "0001.2000-06-06.lokay\t0\t519\tassistance\t0\r\n",
      "0001.2000-06-06.lokay\t0\t519\tvalium\t0\r\n",
      "0001.2000-06-06.lokay\t0\t519\tenlargementWithATypo\t0\r\n",
      "0001.2001-02-07.kitchen\t0\t44\tassistance\t0\r\n",
      "0001.2001-02-07.kitchen\t0\t44\tvalium\t0\r\n",
      "0001.2001-02-07.kitchen\t0\t44\tenlargementWithATypo\t0\r\n",
      "0001.2001-04-02.williams\t0\t196\tassistance\t0\r\n",
      "0001.2001-04-02.williams\t0\t196\tvalium\t0\r\n",
      "0001.2001-04-02.williams\t0\t196\tenlargementWithATypo\t0\r\n",
      "0002.1999-12-13.farmer\t0\t589\tassistance\t0\r\n",
      "0002.1999-12-13.farmer\t0\t589\tvalium\t0\r\n",
      "0002.1999-12-13.farmer\t0\t589\tenlargementWithATypo\t0\r\n",
      "0002.2001-02-07.kitchen\t0\t66\tassistance\t0\r\n",
      "0002.2001-02-07.kitchen\t0\t66\tvalium\t0\r\n",
      "0002.2001-02-07.kitchen\t0\t66\tenlargementWithATypo\t0\r\n",
      "0002.2001-05-25.SA_and_HP\t1\t88\tassistance\t0\r\n",
      "0002.2001-05-25.SA_and_HP\t1\t88\tvalium\t0\r\n",
      "0002.2001-05-25.SA_and_HP\t1\t88\tenlargementWithATypo\t0\r\n",
      "0002.2003-12-18.GP\t1\t176\tassistance\t0\r\n",
      "0002.2003-12-18.GP\t1\t176\tvalium\t0\r\n",
      "0002.2003-12-18.GP\t1\t176\tenlargementWithATypo\t0\r\n",
      "0002.2004-08-01.BG\t1\t135\tassistance\t1\r\n",
      "0002.2004-08-01.BG\t1\t135\tvalium\t0\r\n",
      "0002.2004-08-01.BG\t1\t135\tenlargementWithATypo\t0\r\n",
      "0003.1999-12-10.kaminski\t0\t69\tassistance\t0\r\n",
      "0003.1999-12-10.kaminski\t0\t69\tvalium\t0\r\n",
      "0003.1999-12-10.kaminski\t0\t69\tenlargementWithATypo\t0\r\n",
      "0003.1999-12-14.farmer\t0\t12\tassistance\t0\r\n",
      "0003.1999-12-14.farmer\t0\t12\tvalium\t0\r\n",
      "0003.1999-12-14.farmer\t0\t12\tenlargementWithATypo\t0\r\n",
      "0003.2000-01-17.beck\t0\t199\tassistance\t0\r\n",
      "0003.2000-01-17.beck\t0\t199\tvalium\t0\r\n",
      "0003.2000-01-17.beck\t0\t199\tenlargementWithATypo\t0\r\n",
      "0003.2001-02-08.kitchen\t0\t182\tassistance\t0\r\n",
      "0003.2001-02-08.kitchen\t0\t182\tvalium\t0\r\n",
      "0003.2001-02-08.kitchen\t0\t182\tenlargementWithATypo\t0\r\n",
      "0003.2003-12-18.GP\t1\t118\tassistance\t0\r\n",
      "0003.2003-12-18.GP\t1\t118\tvalium\t0\r\n",
      "0003.2003-12-18.GP\t1\t118\tenlargementWithATypo\t0\r\n",
      "0003.2004-08-01.BG\t1\t106\tassistance\t0\r\n",
      "0003.2004-08-01.BG\t1\t106\tvalium\t0\r\n",
      "0003.2004-08-01.BG\t1\t106\tenlargementWithATypo\t0\r\n",
      "0004.1999-12-10.kaminski\t0\t156\tassistance\t1\r\n",
      "0004.1999-12-10.kaminski\t0\t156\tvalium\t0\r\n",
      "0004.1999-12-10.kaminski\t0\t156\tenlargementWithATypo\t0\r\n",
      "0004.1999-12-14.farmer\t0\t148\tassistance\t0\r\n",
      "0004.1999-12-14.farmer\t0\t148\tvalium\t0\r\n",
      "0004.1999-12-14.farmer\t0\t148\tenlargementWithATypo\t0\r\n",
      "0004.2001-04-02.williams\t0\t95\tassistance\t0\r\n",
      "0004.2001-04-02.williams\t0\t95\tvalium\t0\r\n",
      "0004.2001-04-02.williams\t0\t95\tenlargementWithATypo\t0\r\n",
      "0004.2001-06-12.SA_and_HP\t1\t136\tassistance\t0\r\n",
      "0004.2001-06-12.SA_and_HP\t1\t136\tvalium\t0\r\n",
      "0004.2001-06-12.SA_and_HP\t1\t136\tenlargementWithATypo\t0\r\n",
      "0004.2004-08-01.BG\t1\t105\tassistance\t0\r\n",
      "0004.2004-08-01.BG\t1\t105\tvalium\t0\r\n",
      "0004.2004-08-01.BG\t1\t105\tenlargementWithATypo\t0\r\n",
      "0005.1999-12-12.kaminski\t0\t100\tassistance\t1\r\n",
      "0005.1999-12-12.kaminski\t0\t100\tvalium\t0\r\n",
      "0005.1999-12-12.kaminski\t0\t100\tenlargementWithATypo\t0\r\n",
      "0005.1999-12-14.farmer\t0\t149\tassistance\t0\r\n",
      "0005.1999-12-14.farmer\t0\t149\tvalium\t0\r\n",
      "0005.1999-12-14.farmer\t0\t149\tenlargementWithATypo\t0\r\n",
      "0005.2000-06-06.lokay\t0\t62\tassistance\t0\r\n",
      "0005.2000-06-06.lokay\t0\t62\tvalium\t0\r\n",
      "0005.2000-06-06.lokay\t0\t62\tenlargementWithATypo\t0\r\n",
      "0005.2001-02-08.kitchen\t0\t114\tassistance\t0\r\n",
      "0005.2001-02-08.kitchen\t0\t114\tvalium\t0\r\n",
      "0005.2001-02-08.kitchen\t0\t114\tenlargementWithATypo\t0\r\n",
      "0005.2001-06-23.SA_and_HP\t1\t27\tassistance\t0\r\n",
      "0005.2001-06-23.SA_and_HP\t1\t27\tvalium\t0\r\n",
      "0005.2001-06-23.SA_and_HP\t1\t27\tenlargementWithATypo\t0\r\n",
      "0005.2003-12-18.GP\t1\t1019\tassistance\t0\r\n",
      "0005.2003-12-18.GP\t1\t1019\tvalium\t0\r\n",
      "0005.2003-12-18.GP\t1\t1019\tenlargementWithATypo\t0\r\n",
      "0006.1999-12-13.kaminski\t0\t75\tassistance\t0\r\n",
      "0006.1999-12-13.kaminski\t0\t75\tvalium\t0\r\n",
      "0006.1999-12-13.kaminski\t0\t75\tenlargementWithATypo\t0\r\n",
      "0006.2001-02-08.kitchen\t0\t1432\tassistance\t0\r\n",
      "0006.2001-02-08.kitchen\t0\t1432\tvalium\t0\r\n",
      "0006.2001-02-08.kitchen\t0\t1432\tenlargementWithATypo\t0\r\n",
      "0006.2001-04-03.williams\t0\t48\tassistance\t0\r\n",
      "0006.2001-04-03.williams\t0\t48\tvalium\t0\r\n",
      "0006.2001-04-03.williams\t0\t48\tenlargementWithATypo\t0\r\n",
      "0006.2001-06-25.SA_and_HP\t1\t55\tassistance\t0\r\n",
      "0006.2001-06-25.SA_and_HP\t1\t55\tvalium\t0\r\n",
      "0006.2001-06-25.SA_and_HP\t1\t55\tenlargementWithATypo\t0\r\n",
      "0006.2003-12-18.GP\t1\t144\tassistance\t0\r\n",
      "0006.2003-12-18.GP\t1\t144\tvalium\t0\r\n",
      "0006.2003-12-18.GP\t1\t144\tenlargementWithATypo\t0\r\n",
      "0006.2004-08-01.BG\t1\t150\tassistance\t0\r\n",
      "0006.2004-08-01.BG\t1\t150\tvalium\t0\r\n",
      "0006.2004-08-01.BG\t1\t150\tenlargementWithATypo\t0\r\n",
      "0007.1999-12-13.kaminski\t0\t230\tassistance\t0\r\n",
      "0007.1999-12-13.kaminski\t0\t230\tvalium\t0\r\n",
      "0007.1999-12-13.kaminski\t0\t230\tenlargementWithATypo\t0\r\n",
      "0007.1999-12-14.farmer\t0\t102\tassistance\t0\r\n",
      "0007.1999-12-14.farmer\t0\t102\tvalium\t0\r\n",
      "0007.1999-12-14.farmer\t0\t102\tenlargementWithATypo\t0\r\n",
      "0007.2000-01-17.beck\t0\t425\tassistance\t0\r\n",
      "0007.2000-01-17.beck\t0\t425\tvalium\t0\r\n",
      "0007.2000-01-17.beck\t0\t425\tenlargementWithATypo\t0\r\n",
      "0007.2001-02-09.kitchen\t0\t249\tassistance\t0\r\n",
      "0007.2001-02-09.kitchen\t0\t249\tvalium\t0\r\n",
      "0007.2001-02-09.kitchen\t0\t249\tenlargementWithATypo\t0\r\n",
      "0007.2003-12-18.GP\t1\t167\tassistance\t0\r\n",
      "0007.2003-12-18.GP\t1\t167\tvalium\t0\r\n",
      "0007.2003-12-18.GP\t1\t167\tenlargementWithATypo\t0\r\n",
      "0007.2004-08-01.BG\t1\t195\tassistance\t0\r\n",
      "0007.2004-08-01.BG\t1\t195\tvalium\t0\r\n",
      "0007.2004-08-01.BG\t1\t195\tenlargementWithATypo\t0\r\n",
      "0008.2001-02-09.kitchen\t0\t640\tassistance\t0\r\n",
      "0008.2001-02-09.kitchen\t0\t640\tvalium\t0\r\n",
      "0008.2001-02-09.kitchen\t0\t640\tenlargementWithATypo\t0\r\n",
      "0008.2001-06-12.SA_and_HP\t1\t136\tassistance\t0\r\n",
      "0008.2001-06-12.SA_and_HP\t1\t136\tvalium\t0\r\n",
      "0008.2001-06-12.SA_and_HP\t1\t136\tenlargementWithATypo\t0\r\n",
      "0008.2001-06-25.SA_and_HP\t1\t624\tassistance\t0\r\n",
      "0008.2001-06-25.SA_and_HP\t1\t624\tvalium\t0\r\n",
      "0008.2001-06-25.SA_and_HP\t1\t624\tenlargementWithATypo\t0\r\n",
      "0008.2003-12-18.GP\t1\t150\tassistance\t0\r\n",
      "0008.2003-12-18.GP\t1\t150\tvalium\t0\r\n",
      "0008.2003-12-18.GP\t1\t150\tenlargementWithATypo\t0\r\n",
      "0008.2004-08-01.BG\t1\t857\tassistance\t0\r\n",
      "0008.2004-08-01.BG\t1\t857\tvalium\t0\r\n",
      "0008.2004-08-01.BG\t1\t857\tenlargementWithATypo\t0\r\n",
      "0009.1999-12-13.kaminski\t0\t913\tassistance\t0\r\n",
      "0009.1999-12-13.kaminski\t0\t913\tvalium\t0\r\n",
      "0009.1999-12-13.kaminski\t0\t913\tenlargementWithATypo\t0\r\n",
      "0009.1999-12-14.farmer\t0\t67\tassistance\t0\r\n",
      "0009.1999-12-14.farmer\t0\t67\tvalium\t0\r\n",
      "0009.1999-12-14.farmer\t0\t67\tenlargementWithATypo\t0\r\n",
      "0009.2000-06-07.lokay\t0\t395\tassistance\t0\r\n",
      "0009.2000-06-07.lokay\t0\t395\tvalium\t0\r\n",
      "0009.2000-06-07.lokay\t0\t395\tenlargementWithATypo\t0\r\n",
      "0009.2001-02-09.kitchen\t0\t853\tassistance\t0\r\n",
      "0009.2001-02-09.kitchen\t0\t853\tvalium\t0\r\n",
      "0009.2001-02-09.kitchen\t0\t853\tenlargementWithATypo\t0\r\n",
      "0009.2001-06-26.SA_and_HP\t1\t200\tassistance\t0\r\n",
      "0009.2001-06-26.SA_and_HP\t1\t200\tvalium\t0\r\n",
      "0009.2001-06-26.SA_and_HP\t1\t200\tenlargementWithATypo\t0\r\n",
      "0009.2003-12-18.GP\t1\t97\tassistance\t0\r\n",
      "0009.2003-12-18.GP\t1\t97\tvalium\t1\r\n",
      "0009.2003-12-18.GP\t1\t97\tenlargementWithATypo\t0\r\n",
      "0010.1999-12-14.farmer\t0\t182\tassistance\t0\r\n",
      "0010.1999-12-14.farmer\t0\t182\tvalium\t0\r\n",
      "0010.1999-12-14.farmer\t0\t182\tenlargementWithATypo\t0\r\n",
      "0010.1999-12-14.kaminski\t0\t32\tassistance\t0\r\n",
      "0010.1999-12-14.kaminski\t0\t32\tvalium\t0\r\n",
      "0010.1999-12-14.kaminski\t0\t32\tenlargementWithATypo\t0\r\n",
      "0010.2001-02-09.kitchen\t0\t452\tassistance\t0\r\n",
      "0010.2001-02-09.kitchen\t0\t452\tvalium\t0\r\n",
      "0010.2001-02-09.kitchen\t0\t452\tenlargementWithATypo\t0\r\n",
      "0010.2001-06-28.SA_and_HP\t1\t519\tassistance\t1\r\n",
      "0010.2001-06-28.SA_and_HP\t1\t519\tvalium\t0\r\n",
      "0010.2001-06-28.SA_and_HP\t1\t519\tenlargementWithATypo\t0\r\n",
      "0010.2003-12-18.GP\t1\t8\tassistance\t0\r\n",
      "0010.2003-12-18.GP\t1\t8\tvalium\t0\r\n",
      "0010.2003-12-18.GP\t1\t8\tenlargementWithATypo\t0\r\n",
      "0010.2004-08-01.BG\t1\t311\tassistance\t0\r\n",
      "0010.2004-08-01.BG\t1\t311\tvalium\t0\r\n",
      "0010.2004-08-01.BG\t1\t311\tenlargementWithATypo\t0\r\n",
      "0011.1999-12-14.farmer\t0\t295\tassistance\t0\r\n",
      "0011.1999-12-14.farmer\t0\t295\tvalium\t0\r\n",
      "0011.1999-12-14.farmer\t0\t295\tenlargementWithATypo\t0\r\n",
      "0011.2001-06-28.SA_and_HP\t1\t518\tassistance\t1\r\n",
      "0011.2001-06-28.SA_and_HP\t1\t518\tvalium\t0\r\n",
      "0011.2001-06-28.SA_and_HP\t1\t518\tenlargementWithATypo\t0\r\n",
      "0011.2001-06-29.SA_and_HP\t1\t2503\tassistance\t0\r\n",
      "0011.2001-06-29.SA_and_HP\t1\t2503\tvalium\t0\r\n",
      "0011.2001-06-29.SA_and_HP\t1\t2503\tenlargementWithATypo\t0\r\n",
      "0011.2003-12-18.GP\t1\t70\tassistance\t0\r\n",
      "0011.2003-12-18.GP\t1\t70\tvalium\t0\r\n",
      "0011.2003-12-18.GP\t1\t70\tenlargementWithATypo\t0\r\n",
      "0011.2004-08-01.BG\t1\t98\tassistance\t0\r\n",
      "0011.2004-08-01.BG\t1\t98\tvalium\t0\r\n",
      "0011.2004-08-01.BG\t1\t98\tenlargementWithATypo\t0\r\n",
      "0012.1999-12-14.farmer\t0\t493\tassistance\t0\r\n",
      "0012.1999-12-14.farmer\t0\t493\tvalium\t0\r\n",
      "0012.1999-12-14.farmer\t0\t493\tenlargementWithATypo\t0\r\n",
      "0012.1999-12-14.kaminski\t0\t137\tassistance\t0\r\n",
      "0012.1999-12-14.kaminski\t0\t137\tvalium\t0\r\n",
      "0012.1999-12-14.kaminski\t0\t137\tenlargementWithATypo\t0\r\n",
      "0012.2000-01-17.beck\t0\t422\tassistance\t0\r\n",
      "0012.2000-01-17.beck\t0\t422\tvalium\t0\r\n",
      "0012.2000-01-17.beck\t0\t422\tenlargementWithATypo\t0\r\n",
      "0012.2000-06-08.lokay\t0\t142\tassistance\t0\r\n",
      "0012.2000-06-08.lokay\t0\t142\tvalium\t0\r\n",
      "0012.2000-06-08.lokay\t0\t142\tenlargementWithATypo\t0\r\n",
      "0012.2001-02-09.kitchen\t0\t74\tassistance\t0\r\n",
      "0012.2001-02-09.kitchen\t0\t74\tvalium\t0\r\n",
      "0012.2001-02-09.kitchen\t0\t74\tenlargementWithATypo\t0\r\n",
      "0012.2003-12-19.GP\t1\t22\tassistance\t0\r\n",
      "0012.2003-12-19.GP\t1\t22\tvalium\t0\r\n",
      "0012.2003-12-19.GP\t1\t22\tenlargementWithATypo\t0\r\n",
      "0013.1999-12-14.farmer\t0\t285\tassistance\t0\r\n",
      "0013.1999-12-14.farmer\t0\t285\tvalium\t0\r\n",
      "0013.1999-12-14.farmer\t0\t285\tenlargementWithATypo\t0\r\n",
      "0013.1999-12-14.kaminski\t0\t192\tassistance\t0\r\n",
      "0013.1999-12-14.kaminski\t0\t192\tvalium\t0\r\n",
      "0013.1999-12-14.kaminski\t0\t192\tenlargementWithATypo\t0\r\n",
      "0013.2001-04-03.williams\t0\t95\tassistance\t0\r\n",
      "0013.2001-04-03.williams\t0\t95\tvalium\t0\r\n",
      "0013.2001-04-03.williams\t0\t95\tenlargementWithATypo\t0\r\n",
      "0013.2001-06-30.SA_and_HP\t1\t4518\tassistance\t0\r\n",
      "0013.2001-06-30.SA_and_HP\t1\t4518\tvalium\t0\r\n",
      "0013.2001-06-30.SA_and_HP\t1\t4518\tenlargementWithATypo\t0\r\n",
      "0013.2004-08-01.BG\t1\t213\tassistance\t1\r\n",
      "0013.2004-08-01.BG\t1\t213\tvalium\t0\r\n",
      "0013.2004-08-01.BG\t1\t213\tenlargementWithATypo\t0\r\n",
      "0014.1999-12-14.kaminski\t0\t270\tassistance\t0\r\n",
      "0014.1999-12-14.kaminski\t0\t270\tvalium\t0\r\n",
      "0014.1999-12-14.kaminski\t0\t270\tenlargementWithATypo\t0\r\n",
      "0014.1999-12-15.farmer\t0\t149\tassistance\t0\r\n",
      "0014.1999-12-15.farmer\t0\t149\tvalium\t0\r\n",
      "0014.1999-12-15.farmer\t0\t149\tenlargementWithATypo\t0\r\n",
      "0014.2001-02-12.kitchen\t0\t181\tassistance\t0\r\n",
      "0014.2001-02-12.kitchen\t0\t181\tvalium\t0\r\n",
      "0014.2001-02-12.kitchen\t0\t181\tenlargementWithATypo\t0\r\n",
      "0014.2001-07-04.SA_and_HP\t1\t578\tassistance\t0\r\n",
      "0014.2001-07-04.SA_and_HP\t1\t578\tvalium\t0\r\n",
      "0014.2001-07-04.SA_and_HP\t1\t578\tenlargementWithATypo\t0\r\n",
      "0014.2003-12-19.GP\t1\t24\tassistance\t0\r\n",
      "0014.2003-12-19.GP\t1\t24\tvalium\t0\r\n",
      "0014.2003-12-19.GP\t1\t24\tenlargementWithATypo\t0\r\n",
      "0014.2004-08-01.BG\t1\t105\tassistance\t0\r\n",
      "0014.2004-08-01.BG\t1\t105\tvalium\t0\r\n",
      "0014.2004-08-01.BG\t1\t105\tenlargementWithATypo\t0\r\n",
      "0015.1999-12-14.kaminski\t0\t80\tassistance\t0\r\n",
      "0015.1999-12-14.kaminski\t0\t80\tvalium\t0\r\n",
      "0015.1999-12-14.kaminski\t0\t80\tenlargementWithATypo\t0\r\n",
      "0015.1999-12-15.farmer\t0\t94\tassistance\t0\r\n",
      "0015.1999-12-15.farmer\t0\t94\tvalium\t0\r\n",
      "0015.1999-12-15.farmer\t0\t94\tenlargementWithATypo\t0\r\n",
      "0015.2000-06-09.lokay\t0\t21\tassistance\t0\r\n",
      "0015.2000-06-09.lokay\t0\t21\tvalium\t0\r\n",
      "0015.2000-06-09.lokay\t0\t21\tenlargementWithATypo\t0\r\n",
      "0015.2001-02-12.kitchen\t0\t768\tassistance\t0\r\n",
      "0015.2001-02-12.kitchen\t0\t768\tvalium\t0\r\n",
      "0015.2001-02-12.kitchen\t0\t768\tenlargementWithATypo\t0\r\n",
      "0015.2001-07-05.SA_and_HP\t1\t133\tassistance\t0\r\n",
      "0015.2001-07-05.SA_and_HP\t1\t133\tvalium\t0\r\n",
      "0015.2001-07-05.SA_and_HP\t1\t133\tenlargementWithATypo\t0\r\n",
      "0015.2003-12-19.GP\t1\t183\tassistance\t0\r\n",
      "0015.2003-12-19.GP\t1\t183\tvalium\t0\r\n",
      "0015.2003-12-19.GP\t1\t183\tenlargementWithATypo\t0\r\n",
      "0016.1999-12-15.farmer\t0\t105\tassistance\t0\r\n",
      "0016.1999-12-15.farmer\t0\t105\tvalium\t0\r\n",
      "0016.1999-12-15.farmer\t0\t105\tenlargementWithATypo\t0\r\n",
      "0016.2001-02-12.kitchen\t0\t152\tassistance\t0\r\n",
      "0016.2001-02-12.kitchen\t0\t152\tvalium\t0\r\n",
      "0016.2001-02-12.kitchen\t0\t152\tenlargementWithATypo\t0\r\n",
      "0016.2001-07-05.SA_and_HP\t1\t133\tassistance\t0\r\n",
      "0016.2001-07-05.SA_and_HP\t1\t133\tvalium\t0\r\n",
      "0016.2001-07-05.SA_and_HP\t1\t133\tenlargementWithATypo\t0\r\n",
      "0016.2001-07-06.SA_and_HP\t1\t2673\tassistance\t0\r\n",
      "0016.2001-07-06.SA_and_HP\t1\t2673\tvalium\t0\r\n",
      "0016.2001-07-06.SA_and_HP\t1\t2673\tenlargementWithATypo\t0\r\n",
      "0016.2003-12-19.GP\t1\t99\tassistance\t0\r\n",
      "0016.2003-12-19.GP\t1\t99\tvalium\t1\r\n",
      "0016.2003-12-19.GP\t1\t99\tenlargementWithATypo\t0\r\n",
      "0016.2004-08-01.BG\t1\t99\tassistance\t0\r\n",
      "0016.2004-08-01.BG\t1\t99\tvalium\t0\r\n",
      "0016.2004-08-01.BG\t1\t99\tenlargementWithATypo\t0\r\n",
      "0017.1999-12-14.kaminski\t0\t59\tassistance\t0\r\n",
      "0017.1999-12-14.kaminski\t0\t59\tvalium\t0\r\n",
      "0017.1999-12-14.kaminski\t0\t59\tenlargementWithATypo\t0\r\n",
      "0017.2000-01-17.beck\t0\t423\tassistance\t0\r\n",
      "0017.2000-01-17.beck\t0\t423\tvalium\t0\r\n",
      "0017.2000-01-17.beck\t0\t423\tenlargementWithATypo\t0\r\n",
      "0017.2001-04-03.williams\t0\t65\tassistance\t0\r\n",
      "0017.2001-04-03.williams\t0\t65\tvalium\t0\r\n",
      "0017.2001-04-03.williams\t0\t65\tenlargementWithATypo\t0\r\n",
      "0017.2003-12-18.GP\t1\t33\tassistance\t0\r\n",
      "0017.2003-12-18.GP\t1\t33\tvalium\t0\r\n",
      "0017.2003-12-18.GP\t1\t33\tenlargementWithATypo\t0\r\n",
      "0017.2004-08-01.BG\t1\t99\tassistance\t0\r\n",
      "0017.2004-08-01.BG\t1\t99\tvalium\t1\r\n",
      "0017.2004-08-01.BG\t1\t99\tenlargementWithATypo\t0\r\n",
      "0017.2004-08-02.BG\t1\t357\tassistance\t0\r\n",
      "0017.2004-08-02.BG\t1\t357\tvalium\t0\r\n",
      "0017.2004-08-02.BG\t1\t357\tenlargementWithATypo\t0\r\n",
      "0018.1999-12-14.kaminski\t0\t156\tassistance\t0\r\n",
      "0018.1999-12-14.kaminski\t0\t156\tvalium\t0\r\n",
      "0018.1999-12-14.kaminski\t0\t156\tenlargementWithATypo\t0\r\n",
      "0018.2001-07-13.SA_and_HP\t1\t509\tassistance\t3\r\n",
      "0018.2001-07-13.SA_and_HP\t1\t509\tvalium\t0\r\n",
      "0018.2001-07-13.SA_and_HP\t1\t509\tenlargementWithATypo\t0\r\n",
      "0018.2003-12-18.GP\t1\t516\tassistance\t1\r\n",
      "0018.2003-12-18.GP\t1\t516\tvalium\t0\r\n",
      "0018.2003-12-18.GP\t1\t516\tenlargementWithATypo\t0\r\n"
     ]
    }
   ],
   "source": [
    "# showing the actual output of the mapper\n",
    "!./mapper.py enronemail_1h.txt assistance valium enlargementWithATypo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reducer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting reducer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile reducer.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "output = {}\n",
    "probabilities = dict()\n",
    "\n",
    "# helper function to swap spam and ham\n",
    "def swap(x):\n",
    "    return 1 if x == 0 else 0\n",
    "\n",
    "for file_name in sys.argv[1:]:\n",
    "    for line in open(file_name, 'rb'):\n",
    "        line = line.strip()\n",
    "\n",
    "        try:\n",
    "            # parse the input we got from mapper.py\n",
    "            uid, spam, count_all, word, count_input = line.split('\\t')\n",
    "                        \n",
    "            spam = int(spam)\n",
    "            count_input = int(count_input)\n",
    "            count_all = int(count_all)\n",
    "            \n",
    "            # accumulate the count per word per class\n",
    "            key = (word, spam)\n",
    "            if key in probabilities:\n",
    "                probabilities[key] += count_input\n",
    "            else:\n",
    "                probabilities[key] = count_input\n",
    "                \n",
    "            # make sure both classes are represented\n",
    "            key_swapped = (word, swap(spam))\n",
    "            if key_swapped not in probabilities:\n",
    "                probabilities[key_swapped] = 0\n",
    "            \n",
    "            word_tuple = (word, count_input)\n",
    "            if uid in output:\n",
    "                output[uid]['words'].append(word_tuple)\n",
    "            else:\n",
    "                output[uid] = {'id': uid, 'spam': spam, 'count_all': count_all, 'words': [word_tuple]}\n",
    "            \n",
    "        except Exception, e:\n",
    "            print e\n",
    "            pass\n",
    "\n",
    "spams = sum([output[k]['spam'] for k in output])\n",
    "prior_spam = spams / float(len(output))\n",
    "prior_ham = 1 - prior_spam\n",
    "vocabulary_size = len(set([x[0] for x in probabilities]))\n",
    "\n",
    "def count_all(val):\n",
    "    return sum([output[x]['count_all'] for x in output if output[x]['spam'] == val])\n",
    "\n",
    "words_spam = count_all(1)\n",
    "words_ham = count_all(0)\n",
    "\n",
    "# update probabilities (with add-1 smoothing)\n",
    "for k in probabilities:\n",
    "    spam = key[1]\n",
    "    probabilities[k] = (probabilities[k] + 1) / float((words_spam if spam == 1 else words_ham) + vocabulary_size)\n",
    "\n",
    "# create outputs\n",
    "for out in output:\n",
    "    probability_spam = prior_spam\n",
    "    probability_ham = prior_ham\n",
    "    \n",
    "    for word_tuple in output[out]['words']:\n",
    "        word = word_tuple[0]\n",
    "        count = word_tuple[1]\n",
    "        probability_spam *= probabilities[(word, 1)] ** count\n",
    "        probability_ham *= probabilities[(word, 0)] ** count\n",
    "\n",
    "    prediction = 1 if probability_spam > probability_ham else 0\n",
    "    \n",
    "    print '%s\\t%s\\t%s' % (output[out]['id'], output[out]['spam'], prediction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!./pNaiveBayes.sh 4 assistance valium enlargementWithATypo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_table('enronemail_1h.txt.output', header=None, names=['ID', 'TRUTH', 'CLASS'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>TRUTH</th>\n",
       "      <th>CLASS</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>0012.2003-12-19.GP</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>0016.2001-02-12.kitchen</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>0002.2004-08-01.BG</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>0002.2001-05-25.SA_and_HP</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>0011.2003-12-18.GP</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           ID  TRUTH  CLASS\n",
       "95         0012.2003-12-19.GP      1      0\n",
       "96    0016.2001-02-12.kitchen      0      0\n",
       "97         0002.2004-08-01.BG      1      1\n",
       "98  0002.2001-05-25.SA_and_HP      1      0\n",
       "99         0011.2003-12-18.GP      1      0"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction accuracy percentage: 60.00\n"
     ]
    }
   ],
   "source": [
    "print 'Prediction accuracy percentage: %.2f' % (100 * len(df[df.TRUTH == df.CLASS]) / float(len(df)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting mapper.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mapper.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "\n",
    "file_name = sys.argv[1]\n",
    "\n",
    "# input comes from STDIN (standard input)\n",
    "for email in open(file_name, 'rb'):\n",
    "    email = email.strip()\n",
    "    items = email.split('\\t') \n",
    "    \n",
    "    # outputs\n",
    "    uid = items[0]  # email unique id (first column)\n",
    "    spam = items[1]\n",
    "    count_all = len(email.split())\n",
    "\n",
    "    for word in email.split():\n",
    "        count_input = email.count(word)\n",
    "        print '%s\\t%s\\t%s\\t%s\\t%s' % (uid, spam, count_all, word, count_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reducer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting reducer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile reducer.py\n",
    "#!/usr/bin/env python\n",
    "\n",
    "import sys\n",
    "import math\n",
    "\n",
    "output = {}\n",
    "probabilities = dict()\n",
    "\n",
    "def swap(x):\n",
    "    return 1 if x == 0 else 0\n",
    "\n",
    "for file_name in sys.argv[1:]:\n",
    "    for line in open(file_name, 'rb'):\n",
    "        line = line.strip()\n",
    "\n",
    "        try:\n",
    "            # parse the input we got from mapper.py\n",
    "            uid, spam, count_all, word, count_input = line.split('\\t')\n",
    "                        \n",
    "            spam = int(spam)\n",
    "            count_input = int(count_input)\n",
    "            count_all = int(count_all)\n",
    "            \n",
    "            # accumulate the count per word per class\n",
    "            key = (word, spam)\n",
    "            if key in probabilities:\n",
    "                probabilities[key] += count_input\n",
    "            else:\n",
    "                probabilities[key] = count_input\n",
    "                \n",
    "            # make sure both classes are represented\n",
    "            key_swapped = (word, swap(spam))\n",
    "            if key_swapped not in probabilities:\n",
    "                probabilities[key_swapped] = 0\n",
    "            \n",
    "            word_tuple = (word, count_input)\n",
    "            if uid in output:\n",
    "                output[uid]['words'].append(word_tuple)\n",
    "            else:\n",
    "                output[uid] = {'id': uid, 'spam': spam, 'count_all': count_all, 'words': [word_tuple]}\n",
    "            \n",
    "        except Exception, e:\n",
    "            print e\n",
    "            pass\n",
    "\n",
    "spams = sum([output[k]['spam'] for k in output])\n",
    "prior_spam = spams / float(len(output))\n",
    "prior_ham = 1 - prior_spam\n",
    "vocabulary_size = len(set([x[0] for x in probabilities]))\n",
    "\n",
    "def count_all(val):\n",
    "    return sum([output[x]['count_all'] for x in output if output[x]['spam'] == val])\n",
    "\n",
    "words_spam = count_all(1)\n",
    "words_ham = count_all(0)\n",
    "\n",
    "# update probabilities (with add-1 smoothing)\n",
    "for k in probabilities:\n",
    "    spam = key[1]\n",
    "    probabilities[k] = (probabilities[k] + 1) / float((words_spam if spam == 1 else words_ham) + vocabulary_size)\n",
    "\n",
    "# create outputs\n",
    "for out in output:\n",
    "    probability_spam = math.log(prior_spam)\n",
    "    probability_ham = math.log(prior_ham)\n",
    "    \n",
    "    for word_tuple in output[out]['words']:\n",
    "        word = word_tuple[0]\n",
    "        count = word_tuple[1]\n",
    "        probability_spam += math.log(probabilities[(word, 1)]) * count\n",
    "        probability_ham += math.log(probabilities[(word, 0)]) * count\n",
    "\n",
    "    prediction = 1 if probability_spam > probability_ham else 0\n",
    "    \n",
    "    print '%s\\t%s\\t%s' % (output[out]['id'], output[out]['spam'], prediction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!./pNaiveBayes.sh 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_table('enronemail_1h.txt.output',\n",
    "                   header=None, names=['ID', 'TRUTH', 'CLASS'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>TRUTH</th>\n",
       "      <th>CLASS</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0010.2003-12-18.GP</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0010.2001-06-28.SA_and_HP</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0001.2000-01-17.beck</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0018.1999-12-14.kaminski</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0005.1999-12-12.kaminski</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          ID  TRUTH  CLASS\n",
       "0         0010.2003-12-18.GP      1      1\n",
       "1  0010.2001-06-28.SA_and_HP      1      1\n",
       "2       0001.2000-01-17.beck      0      1\n",
       "3   0018.1999-12-14.kaminski      0      1\n",
       "4   0005.1999-12-12.kaminski      0      0"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction accuracy percentage: 68.00\n"
     ]
    }
   ],
   "source": [
    "accuracy = len(df[df.TRUTH == df.CLASS]) / float(len(df))\n",
    "print 'Prediction accuracy percentage: %.2f' % \\\n",
    "(100 * accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1.6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vec = CountVectorizer()\n",
    "X = vec.fit_transform([email for email in open('enronemail_1h.txt')])\n",
    "y = [email.split('\\t')[1] for email in open('enronemail_1h.txt')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### sklearn multinomial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error (sklearn): 0.0\n",
      "Error (poor man): 0.32\n"
     ]
    }
   ],
   "source": [
    "clf = MultinomialNB()\n",
    "clf.fit(X, y)\n",
    "print 'Error (sklearn): %s' % (1 - clf.score(X, y))\n",
    "print 'Error (poor man): %s' % (1 - accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default sklearn's CountVectorizer gets rid of punctuation which helps with normalization of features. In other words \"this,\" and \"this\" are not considered two separate features. I did not use removal of stop words or enforcing thresholds (e.g. min/max_df) to further cleanup the list of features. Seems like even without those the removal of punctuation has helped a great deal. An improvement to my mapper can be to use to filter out punctuations which may help with error rate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### sklearn bernoulli"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error: 0.15\n"
     ]
    }
   ],
   "source": [
    "clf = BernoulliNB()\n",
    "clf.fit(X, y)\n",
    "print 'Error: %s' % (1 - clf.score(X, y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Empirical research shows that neither model is consistently outperforming the other one. The results are predominantly a function of pre-processing and the subsequent feature selection. Theoretically binarizing the features would result in losing information (i.e. how many times the term showed up in relation to the length of the document) which may put Bernoulli at a disadvantage sometimes. However this very feature can also result in a model that is generalizing better (less variance) which may produce better outcomes if the corresponding multinomial model is modeling too much of the variance and overfitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Error\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Poor man</th>\n",
       "      <th>sklearn (bernoulli)</th>\n",
       "      <th>sklearn (multinomial)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.32</td>\n",
       "      <td>0.15</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Poor man  sklearn (bernoulli)  sklearn (multinomial)\n",
       "0      0.32                 0.15                      0"
      ]
     },
     "execution_count": 185,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print 'Error'\n",
    "pd.DataFrame({'Poor man': [0.32],\n",
    "              'sklearn (multinomial)': [0.0],\n",
    "              'sklearn (bernoulli)': [0.15]}, )"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}