suranands/gist:e5a1acfc7ad2712d6658

## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# How To Implement Naive Bayes From Scratch in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What follows from here is entirely copied from the [Source](http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/) as it is. All I did was to ensure that I understood each step before going to the next while copying each line of code and text as well. Thanks to [Jason Brownlee](http://machinelearningmastery.com/author/jasonb/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Naive Bayes Algorithm Tutorial\n",
    "\n",
    "This tutorial is broken down into the following steps:\n",
    "\n",
    "- Handle Data: Load the data from CSV file and split it into training and test datasets.\n",
    "- Summarize Data: summarize the properties in the training dataset so that we can calculate probabilities and make predictions.\n",
    "- Make a Prediction: Use the summaries of the dataset to generate a single prediction.\n",
    "- Make Predictions: Generate predictions given a test dataset and a summarized training dataset.\n",
    "- Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.\n",
    "- Tie it Together: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Handle Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We can open the file with the open function and read the data lines using the reader function in the csv module.\n",
    "\n",
    "We also need to convert the attributes that were loaded as strings into numbers that we can work with them. Below is the `loadCsv()` function for loading the Pima indians dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def loadCsv(filename):\n",
    "    lines = csv.reader(open(filename, 'rb'))\n",
    "    dataset = list(lines)\n",
    "    for i in range(len(dataset)):\n",
    "        dataset[i] = [float(x) for x in dataset[i]]\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this function by loading the pima indians dataset and printing the number of data instances that were loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded data file pima-indians-diabetes.data.csv with 768 rows\n"
     ]
    }
   ],
   "source": [
    "filename = 'pima-indians-diabetes.data.csv'\n",
    "dataset = loadCsv(filename)\n",
    "print ('Loaded data file {0} with {1} rows').format(filename, len(dataset))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "function"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(loadCsv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we need to split the data into a training dataset that Naive Bayes can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the data set randomly into train and datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).\n",
    "\n",
    "Below is the `splitDataset()` function that will split a given dataset into a given split ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random\n",
    "def splitDataset(dataset, splitRatio):\n",
    "    trainSize = int(len(dataset) * splitRatio)\n",
    "    trainSet = []\n",
    "    copy = list(dataset)\n",
    "    while len(trainSet) < trainSize:\n",
    "        index = random.randrange(len(copy))\n",
    "        trainSet.append(copy.pop(index))\n",
    "    return [trainSet, copy]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this out by defining a mock dataset with 5 instances, split it into training and testing datasets and print them out to see which data instances ended up where."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 1000 rows into train with 670 and test with 330\n"
     ]
    }
   ],
   "source": [
    "dataset = random.sample(xrange(100000), 1000)\n",
    "splitRatio = 0.67\n",
    "train, test = splitDataset(dataset, splitRatio)\n",
    "print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), len(train), len(test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "670.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1000*0.67"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "670"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainSize = int(len(dataset) * splitRatio)\n",
    "trainSize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "copy =  1000\n",
      "trainSet =  670\n"
     ]
    }
   ],
   "source": [
    "copy = list(dataset)\n",
    "print 'copy = ', len(copy)\n",
    "trainSet = []\n",
    "while len(trainSet) < trainSize:\n",
    "    index = random.randrange(len(copy))\n",
    "    trainSet.append(copy.pop(index))\n",
    "print 'trainSet = ', len(trainSet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train =  670\n",
      "test =  330\n"
     ]
    }
   ],
   "source": [
    "train, test = splitDataset(dataset, splitRatio)\n",
    "print 'train = ', len(train)\n",
    "print 'test = ', len(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "670"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainSize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[79974, 97087, 50681, 9806, 69316, 34754, 33027, 54282, 8922, 7733, 22827, 47330, 81150, 49203, 28029, 43636, 96181, 67862, 46188, 95612]\n"
     ]
    }
   ],
   "source": [
    "print dataset[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[79974, 97087, 69316, 34754, 33027, 8922, 22827, 81150, 67862, 46188, 92838, 5576, 37012, 66983, 2461, 63533, 97315, 40729, 29360, 96030]\n"
     ]
    }
   ],
   "source": [
    "print copy[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "330"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(copy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "273"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random.randrange(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 768 rows into train with 514 and test with 254\n"
     ]
    }
   ],
   "source": [
    "dataset = loadCsv(filename)\n",
    "splitRatio = 0.67\n",
    "train, test = splitDataset(dataset, splitRatio)\n",
    "print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), len(train), len(test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Summarize Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The naive bayes model is comprised of a summary of the data in the training dataset. This summary is then used when making predictions.\n",
    "\n",
    "The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value. For example, if there are two class values and 7 numerical attributes, then we need a mean and standard deviation for each attribute (7) and class value (2) combination, that is 14 attribute summaries.\n",
    "\n",
    "These are required when making predictions to calculate the probability of specific attribute values belonging to each class value.\n",
    "\n",
    "We can break the preparation of this summary data down into the following sub-tasks:\n",
    "\n",
    "- Separate Data By Class\n",
    "- Calculate Mean\n",
    "- Calculate Standard Deviation\n",
    "- Summarize Dataset\n",
    "- Summarize Attributes By Class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Separate Data By Class\n",
    "\n",
    "The first task is to separate the training dataset instances by class value so that we can calculate statistics for each class. We can do that by creating a map of each class value to a list of instances that belong to that class and sort the entire dataset of instances into the appropriate lists.\n",
    "\n",
    "The `separateByClass()` function below does just this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def separateByClass(dataset):\n",
    "    separated = {}\n",
    "    for i in range(len(dataset)):\n",
    "        vector = dataset[i]\n",
    "        if (vector[-1] not in separated):\n",
    "            separated[vector[-1]] = []\n",
    "        separated[vector[-1]].append(vector)\n",
    "    return separated"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the function assumes that the last attribute (-1) is the class value. The function returns a map of class values to lists of data instances.\n",
    "\n",
    "We can test this function with some sample data, as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}\n"
     ]
    }
   ],
   "source": [
    "dataset = [[1,20,1], [2,21,0], [3,22,1]]\n",
    "separated = separateByClass(dataset)\n",
    "print('Separated instances: {0}').format(separated)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "separated2 = {}\n",
    "for i in range(len(dataset)):\n",
    "    vector = dataset[i]\n",
    "    if (vector[-1] not in separated2):\n",
    "        separated2[vector[-1]] = []\n",
    "    separated2[vector[-1]].append(vector)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "separated2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 20, 1]\n",
      "[2, 21, 0]\n",
      "[3, 22, 1]\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(dataset)):\n",
    "    vector = dataset[i]\n",
    "    print vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vector[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1, 20, 1], [3, 22, 1]]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "separated2[vector[-1]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Separated instances: 2\n"
     ]
    }
   ],
   "source": [
    "dataset = loadCsv(filename)\n",
    "separated = separateByClass(dataset)\n",
    "print('Separated instances: {0}').format(len(separated))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate Mean & Standard Deviation\n",
    "\n",
    "We need to calculate the mean of each attribute for a class value. The mean is the central middle or central tendency of the data, and we will use it as the middle of our gaussian distribution when calculating probabilities.\n",
    "\n",
    "We also need to calculate the standard deviation of each attribute for a class value. The standard deviation describes the variation of spread of the data, and we will use it to characterize the expected spread of each attribute in our Gaussian distribution when calculating probabilities.\n",
    "\n",
    "The standard deviation is calculated as the square root of the variance. The variance is calculated as the average of the squared differences for each attribute value from the mean. Note we are using the N-1 method, which subtracts 1 from the number of attribute values when calculating the variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def mean(numbers):\n",
    "    return sum(numbers)/float(len(numbers))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def stdev(numbers):\n",
    "    avg = mean(numbers)\n",
    "    variance = sum([pow(x-avg, 2) for x in numbers])/float(len(numbers)-1)\n",
    "    return math.sqrt(variance)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this by taking the mean of the numbers from 1 to 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean =  3.0\n",
      "StDev =  1.58113883008\n",
      "Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008\n"
     ]
    }
   ],
   "source": [
    "numbers = [1,2,3,4,5]\n",
    "print 'Mean = ', mean(numbers)\n",
    "print 'StDev = ', stdev(numbers)\n",
    "print ('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summarize Dataset\n",
    "\n",
    "Now we have the tools to summarize a dataset. For a given list of instances (for a class value) we can calculate the mean and the standard deviation for each attribute.\n",
    "\n",
    "The zip function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for the attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def summarize(dataset):\n",
    "    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]\n",
    "    del summaries[-1]\n",
    "    return summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this `summarize()` function with some test data that shows markedly different mean and standard deviation values for the first and second data attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Attribute summaries: [(2.0, 1.0), (21.0, 1.0)]\n"
     ]
    }
   ],
   "source": [
    "dataset = [[1,20,0], [2,21,1], [3,22,0]]\n",
    "summary = summarize(dataset)\n",
    "print('Attribute summaries: {0}').format(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summarize Attributes By Class\n",
    "\n",
    "We can pull it all together by first separating our training dataset into instances grouped by class. Then calculate the summaries for each attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def summarizeByClass(dataset):\n",
    "    separated = separateByClass(dataset)\n",
    "    summaries = {}\n",
    "    for classValue, instances in separated.iteritems():\n",
    "        summaries[classValue] = summarize(instances)\n",
    "    return summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this `summarizeByClass()` function with a small test dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary by class value: {0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)], 1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]}\n"
     ]
    }
   ],
   "source": [
    "dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]]\n",
    "summary = summarizeByClass(dataset)\n",
    "print('Summary by class value: {0}').format(summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary by class value: {0.0: [(3.298, 3.01718458262189), (109.98, 26.14119975535359), (68.184, 18.063075413305828), (19.664, 14.889947113744254), (68.792, 98.86528929231767), (30.30419999999996, 7.689855011650112), (0.42973400000000017, 0.29908530435741093), (31.19, 11.667654791631156)], 1.0: [(4.865671641791045, 3.741239044041554), (141.25746268656715, 31.939622058007195), (70.82462686567165, 21.49181165060413), (22.16417910447761, 17.67971140046571), (100.33582089552239, 138.6891247315351), (35.14253731343278, 7.262967242346376), (0.5505, 0.372354483554611), (37.06716417910448, 10.968253652367915)]}\n"
     ]
    }
   ],
   "source": [
    "dataset = loadCsv(filename)\n",
    "summary = summarizeByClass(dataset)\n",
    "print('Summary by class value: {0}').format(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Make Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction.\n",
    "\n",
    "We can divide this part into the following tasks:\n",
    "\n",
    "- Calculate Gaussian Probability Density Function\n",
    "- Calculate Class Probabilities\n",
    "- Make a Prediction\n",
    "- Estimate Accuracy\n",
    "\n",
    "### Calculate Gaussian Probability Density Function\n",
    "\n",
    "We can use a Gaussian function to estimate the probability of a given attribute value, given the known mean and standard deviation for the attribute estimated from the training data.\n",
    "\n",
    "Given that the attribute summaries where prepared for each attribute and class value, the result is the conditional probability of a given attribute value given a class value.\n",
    "\n",
    "See the references for the details of this equation for the Gaussian probability density function. In summary we are plugging our known details into the Gaussian (attribute value, mean and standard deviation) and reading off the likelihood that our attribute value belongs to the class.\n",
    "\n",
    "In the `calculateProbability()` function we calculate the exponent first, then calculate the main division. This lets us fit the equation nicely on two lines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note: To be able to understand this part at least, I would have to read a lot of stuff similar to [this resource](https://www.mne.psu.edu/me345/Lectures/Gaussian_or_Normal_PDF.pdf).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "def calculateProbability(x, mean, stdev):\n",
    "\texponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))\n",
    "\treturn (1 / (math.sqrt(2*math.pi) * stdev)) * exponent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this with some sample data, as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability of belonging to this class: 0.0624896575937\n"
     ]
    }
   ],
   "source": [
    "x = 71.5\n",
    "mean = 73\n",
    "stdev = 6.2\n",
    "probability = calculateProbability(x, mean, stdev)\n",
    "print('Probability of belonging to this class: {0}').format(probability)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate Class Probabilities\n",
    "\n",
    "Now that we can calculate the probability of an attribute belonging to a class, we can combine the probabilities of all of the attribute values for a data instance and come up with a probability of the entire data instance belonging to the class.\n",
    "\n",
    "We combine probabilities together by multiplying them. In the `calculateClassProbabilities()` below, the probability of a given data instance is calculated by multiplying together the attribute probabilities for each class. the result is a map of class values to probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def calculateClassProbabilities(summaries, inputVector):\n",
    "\tprobabilities = {}\n",
    "\tfor classValue, classSummaries in summaries.iteritems():\n",
    "\t\tprobabilities[classValue] = 1\n",
    "\t\tfor i in range(len(classSummaries)):\n",
    "\t\t\tmean, stdev = classSummaries[i]\n",
    "\t\t\tx = inputVector[i]\n",
    "\t\t\tprobabilities[classValue] *= calculateProbability(x, mean, stdev)\n",
    "\treturn probabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test the `calculateClassProbabilities()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}\n"
     ]
    }
   ],
   "source": [
    "summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]}\n",
    "inputVector = [1.1, '?']\n",
    "probabilities = calculateClassProbabilities(summaries, inputVector)\n",
    "print('Probabilities for each class: {0}').format(probabilities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make a Prediction\n",
    "\n",
    "Now that can calculate the probability of a data instance belonging to each class value, we can look for the largest probability and return the associated class.\n",
    "\n",
    "The `predict()` function belong does just that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict(summaries, inputVector):\n",
    "\tprobabilities = calculateClassProbabilities(summaries, inputVector)\n",
    "\tbestLabel, bestProb = None, -1\n",
    "\tfor classValue, probability in probabilities.iteritems():\n",
    "\t\tif bestLabel is None or probability > bestProb:\n",
    "\t\t\tbestProb = probability\n",
    "\t\t\tbestLabel = classValue\n",
    "\treturn bestLabel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test the `predict()` function as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction: A\n"
     ]
    }
   ],
   "source": [
    "summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}\n",
    "inputVector = [1.1, '?']\n",
    "result = predict(summaries, inputVector)\n",
    "print('Prediction: {0}').format(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Make Predictions\n",
    "\n",
    "Finally, we can estimate the accuracy of the model by making predictions for each data instance in our test dataset. The `getPredictions()` will do this and return a list of predictions for each test instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def getPredictions(summaries, testSet):\n",
    "\tpredictions = []\n",
    "\tfor i in range(len(testSet)):\n",
    "\t\tresult = predict(summaries, testSet[i])\n",
    "\t\tpredictions.append(result)\n",
    "\treturn predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test the `getPredictions()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predictions: ['A', 'B']\n"
     ]
    }
   ],
   "source": [
    "summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}\n",
    "testSet = [[1.1, '?'], [19.1, '?']]\n",
    "predictions = getPredictions(summaries, testSet)\n",
    "print('Predictions: {0}').format(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Get Accuracy\n",
    "\n",
    "The predictions can be compared to the class values in the test dataset and a classification accuracy can be calculated as an accuracy ratio between 0& and 100%. The `getAccuracy()` will calculate this accuracy ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def getAccuracy(testSet, predictions):\n",
    "\tcorrect = 0\n",
    "\tfor x in range(len(testSet)):\n",
    "\t\tif testSet[x][-1] == predictions[x]:\n",
    "\t\t\tcorrect += 1\n",
    "\treturn (correct/float(len(testSet))) * 100.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test the `getAccuracy()` function using the sample code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 66.6666666667\n"
     ]
    }
   ],
   "source": [
    "testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]\n",
    "predictions = ['a', 'a', 'a']\n",
    "accuracy = getAccuracy(testSet, predictions)\n",
    "print('Accuracy: {0}').format(accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Tie it Together\n",
    "\n",
    "Finally, we need to tie it all together.\n",
    "\n",
    "Below provides the full code listing for Naive Bayes implemented from scratch in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 768 rows into train=514 and test=254 rows\n",
      "Accuracy: 75.1968503937%\n"
     ]
    }
   ],
   "source": [
    "# Example of Naive Bayes implemented from Scratch in Python\n",
    "import csv\n",
    "import random\n",
    "import math\n",
    "\n",
    "def loadCsv(filename):\n",
    "\tlines = csv.reader(open(filename, \"rb\"))\n",
    "\tdataset = list(lines)\n",
    "\tfor i in range(len(dataset)):\n",
    "\t\tdataset[i] = [float(x) for x in dataset[i]]\n",
    "\treturn dataset\n",
    "\n",
    "def splitDataset(dataset, splitRatio):\n",
    "\ttrainSize = int(len(dataset) * splitRatio)\n",
    "\ttrainSet = []\n",
    "\tcopy = list(dataset)\n",
    "\twhile len(trainSet) < trainSize:\n",
    "\t\tindex = random.randrange(len(copy))\n",
    "\t\ttrainSet.append(copy.pop(index))\n",
    "\treturn [trainSet, copy]\n",
    "\n",
    "def separateByClass(dataset):\n",
    "\tseparated = {}\n",
    "\tfor i in range(len(dataset)):\n",
    "\t\tvector = dataset[i]\n",
    "\t\tif (vector[-1] not in separated):\n",
    "\t\t\tseparated[vector[-1]] = []\n",
    "\t\tseparated[vector[-1]].append(vector)\n",
    "\treturn separated\n",
    "\n",
    "def mean(numbers):\n",
    "\treturn sum(numbers)/float(len(numbers))\n",
    "\n",
    "def stdev(numbers):\n",
    "\tavg = mean(numbers)\n",
    "\tvariance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)\n",
    "\treturn math.sqrt(variance)\n",
    "\n",
    "def summarize(dataset):\n",
    "\tsummaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]\n",
    "\tdel summaries[-1]\n",
    "\treturn summaries\n",
    "\n",
    "def summarizeByClass(dataset):\n",
    "\tseparated = separateByClass(dataset)\n",
    "\tsummaries = {}\n",
    "\tfor classValue, instances in separated.iteritems():\n",
    "\t\tsummaries[classValue] = summarize(instances)\n",
    "\treturn summaries\n",
    "\n",
    "def calculateProbability(x, mean, stdev):\n",
    "\texponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))\n",
    "\treturn (1 / (math.sqrt(2*math.pi) * stdev)) * exponent\n",
    "\n",
    "def calculateClassProbabilities(summaries, inputVector):\n",
    "\tprobabilities = {}\n",
    "\tfor classValue, classSummaries in summaries.iteritems():\n",
    "\t\tprobabilities[classValue] = 1\n",
    "\t\tfor i in range(len(classSummaries)):\n",
    "\t\t\tmean, stdev = classSummaries[i]\n",
    "\t\t\tx = inputVector[i]\n",
    "\t\t\tprobabilities[classValue] *= calculateProbability(x, mean, stdev)\n",
    "\treturn probabilities\n",
    "\t\t\t\n",
    "def predict(summaries, inputVector):\n",
    "\tprobabilities = calculateClassProbabilities(summaries, inputVector)\n",
    "\tbestLabel, bestProb = None, -1\n",
    "\tfor classValue, probability in probabilities.iteritems():\n",
    "\t\tif bestLabel is None or probability > bestProb:\n",
    "\t\t\tbestProb = probability\n",
    "\t\t\tbestLabel = classValue\n",
    "\treturn bestLabel\n",
    "\n",
    "def getPredictions(summaries, testSet):\n",
    "\tpredictions = []\n",
    "\tfor i in range(len(testSet)):\n",
    "\t\tresult = predict(summaries, testSet[i])\n",
    "\t\tpredictions.append(result)\n",
    "\treturn predictions\n",
    "\n",
    "def getAccuracy(testSet, predictions):\n",
    "\tcorrect = 0\n",
    "\tfor i in range(len(testSet)):\n",
    "\t\tif testSet[i][-1] == predictions[i]:\n",
    "\t\t\tcorrect += 1\n",
    "\treturn (correct/float(len(testSet))) * 100.0\n",
    "\n",
    "def main():\n",
    "\tfilename = 'pima-indians-diabetes.data.csv'\n",
    "\tsplitRatio = 0.67\n",
    "\tdataset = loadCsv(filename)\n",
    "\ttrainingSet, testSet = splitDataset(dataset, splitRatio)\n",
    "\tprint('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))\n",
    "\t# prepare model\n",
    "\tsummaries = summarizeByClass(trainingSet)\n",
    "\t# test model\n",
    "\tpredictions = getPredictions(summaries, testSet)\n",
    "\taccuracy = getAccuracy(testSet, predictions)\n",
    "\tprint('Accuracy: {0}%').format(accuracy)\n",
    "\n",
    "main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation Extensions\n",
    "\n",
    "This section provides you with ideas for extensions that you could apply and investigate with the Python code you have implemented as part of this tutorial.\n",
    "\n",
    "You have implemented your own version of Gaussian Naive Bayes in python from scratch.\n",
    "\n",
    "You can extend the implementation further.\n",
    "\n",
    "- **Calculate Class Probabilities**: Update the example to summarize the probabilities of a data instance belonging to each class as a ratio. This can be calculated as the probability of a data instance belonging to one class, divided by the sum of the probabilities of the data instance belonging to each class. For example an instance had the probability of 0.02 for class A and 0.001 for class B, the likelihood of the instance belonging to class A is (0.02/(0.02+0.001))*100 which is about 95.23%.\n",
    "- **Log Probabilities**: The conditional probabilities for each class given an attribute value are small. When they are multiplied together they result in very small values, which can lead to floating point underflow (numbers too small to represent in Python). A common fix for this is to combine the log of the probabilities together. Research and implement this improvement.\n",
    "- **Nominal Attributes:** Update the implementation to support nominal attributes. This is much similar and the summary information you can collect for each attribute is the ratio of category values for each class. Dive into the references for more information.\n",
    "- **Different Density Function** (bernoulli or multinomial): We have looked at Gaussian Naive Bayes, but you can also look at other distributions. Implement a different distribution such as multinomial, bernoulli or kernel naive bayes that make different assumptions about the distribution of attribute values and/or their relationship with the class value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resources and Further Reading\n",
    "\n",
    "This section will provide some resources that you can use to learn more about the Naive Bayes algorithm in terms of both theory of how and why it works and practical concerns for implementing it in code.\n",
    "\n",
    "### Problem\n",
    "\n",
    "More resources for learning about the problem of predicting the onset of diabetes.\n",
    "\n",
    "- [Pima Indians Diabetes Data Set](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes): This page provides access to the dataset files, describes the attributes and lists papers that use the dataset.\n",
    "- [Dataset File](https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data): The dataset file.\n",
    "- [Dataset Summary](https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names): Description of the dataset attributes.\n",
    "- [Diabetes Dataset Results](http://www.is.umk.pl/projects/datasets.html#Diabetes): The accuracy of many standard algorithms on this dataset.\n",
    "\n",
    "### Code\n",
    "\n",
    "This section links to open source implementations of Naive Bayes in popular machine learning libraries. Review these if you are considering implementing your own version of the method for operational use.\n",
    "\n",
    "- [Naive Bayes in Scikit-Learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py): Implementation of naive bayes in the scikit-learn library.\n",
    "- [Naive Bayes documentation](http://scikit-learn.org/stable/modules/naive_bayes.html): Scikit-Learn documentation and sample code for Naive Bayes\n",
    "- [Simple Naive Bayes in Weka](https://github.com/baron/weka/blob/master/weka/src/main/java/weka/classifiers/bayes/NaiveBayesSimple.java): Weka implementation of naive bayes\n",
    "\n",
    "### Next Step\n",
    "\n",
    "Take action.\n",
    "\n",
    "Follow the tutorial and implement Naive Bayes from scratch. Adapt the example to another problem. Follow the extensions and improve upon the implementation.\n",
    "\n",
    "Leave a comment and share your experiences.\n",
    "\n",
    "**Update**: Check out the follow-up on tips for using the naive bayes algorithm titled: “Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm“"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# My Conclusion\n",
    "\n",
    "From all that I did here, I understood only the programming part. I mean the logic of each line of code. But since I do not know the statistical models, nor anything about machine learning, I did not understand actual stuff that this post is meant for. I am still happy I understood the coding logic. For instance, what is separating dataset by class value is something I am still not sure about why we should do that and what it is all about. But with the given dataset examples (not the csv file), I could see the logic used in coding. I hope I will understand what I wrote here if I read it myself after some time. I suddenly remember, \"You cannot explain something clearly unless you undertood it fully.\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  },
  "name": "2015-07-20-130838.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 0
}