samueleverett01/SKL-KNN.ipynb

## SKL-KNN.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-Learn K-Nearest Neighbors Classifier on MNIST\n",
    "\n",
    "Lets jump right into it by importing the required libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((70000, 784), (70000,))"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "from sklearn import datasets, model_selection\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "mnist = datasets.fetch_mldata('MNIST original')\n",
    "data, target = mnist.data, mnist.target\n",
    "\n",
    "# make sure everything was correctly imported\n",
    "data.shape, target.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building data sets\n",
    "Lets start the construction of the K-NN model by making a couple different data sets we can play with.  We will make a function that can take the size of the dataset we want and return it.  These data sets will be used by the model to classify our testing data.  \n",
    "\n",
    "Lets construct some of the model's stored data sets below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make an array of indices the size of MNIST to use for making the data sets.\n",
    "# This array is in random order, so we can use it to scramble up the MNIST data\n",
    "indx = np.random.choice(len(target), 70000, replace=False)\n",
    "\n",
    "# method for building datasets to test with\n",
    "def mk_dataset(size):\n",
    "    \"\"\"makes a dataset of size \"size\", and returns that datasets images and targets\n",
    "    This is used to make the dataset that will be stored by a model and used in \n",
    "    experimenting with different stored dataset sizes\n",
    "    \"\"\"\n",
    "    train_img = [data[i] for i in indx[:size]]\n",
    "    train_img = np.array(train_img)\n",
    "    train_target = [target[i] for i in indx[:size]]\n",
    "    train_target = np.array(train_target)\n",
    "    \n",
    "    return train_img, train_target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great.  Now we will use this function to build two data sets of different sizes we can use to see how the model performs when it has different amounts of data available to it in making classifications.  **Hint: making a smaller data set is like taking some of the points away in the images in the tutorial, you will still be able to make classifications but the model will have fewer points to work with, making it harder to make correct classifications.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((50000, 784), (50000,))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# lets make a dataset of size 50,000, meaning the model will have 50,000 data points to compare each \n",
    "# new point it is to classify to\n",
    "fifty_x, fifty_y = mk_dataset(50000)\n",
    "fifty_x.shape, fifty_y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((20000, 784), (20000,))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# lets make one more of size 20,000 and see how classification accuracy decreases when we use that one\n",
    "twenty_x, twenty_y = mk_dataset(20000)\n",
    "twenty_x.shape, twenty_y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how these data sets the model has needs the labels with it.  The model needs these labels to understand what each point represents, and can thus put the point we are trying to classify, $p$, into a specific class as opposed to saying \"this is what this point is most similar to\", which you can't do as much with.\n",
    "\n",
    "Now we will build a testing data set of size 10,000.  This is the data set we will run through the model, and see how the model does at classifying each point in the testing data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((10000, 784), (10000,))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# build model testing dataset\n",
    "test_img = [data[i] for i in indx[60000:70000]]\n",
    "test_img1 = np.array(test_img)\n",
    "test_target = [target[i] for i in indx[60000:70000]]\n",
    "test_target1 = np.array(test_target)\n",
    "test_img1.shape, test_target1.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great!  Now that we have all of our data set up, we can start playin with the K-NN model!\n",
    "\n",
    "### Building the model\n",
    "We will start by putting the Scikit-Learn K-NN model into a function so we can easily call it and adjust it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def skl_knn(k, test_data, test_target, stored_data, stored_target):\n",
    "    \"\"\"k: number of neighbors to use in classication\n",
    "    test_data: the data/targets used to test the classifier\n",
    "    stored_data: the data/targets used to classify the test_data\n",
    "    \"\"\"\n",
    "    \n",
    "    classifier = KNeighborsClassifier(n_neighbors=k)  \n",
    "    classifier.fit(stored_data, stored_target)\n",
    "\n",
    "    y_pred = classifier.predict(test_data) \n",
    "\n",
    "    print(classification_report(test_target, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing\n",
    "\n",
    "Now lets see how this model performs on the two different test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "        0.0       0.98      0.99      0.99       997\n",
      "        1.0       0.96      1.00      0.98      1118\n",
      "        2.0       0.98      0.96      0.97      1041\n",
      "        3.0       0.96      0.98      0.97      1036\n",
      "        4.0       0.98      0.97      0.97       966\n",
      "        5.0       0.97      0.97      0.97       924\n",
      "        6.0       0.98      0.99      0.99       918\n",
      "        7.0       0.96      0.98      0.97      1053\n",
      "        8.0       0.99      0.92      0.96       977\n",
      "        9.0       0.96      0.96      0.96       970\n",
      "\n",
      "avg / total       0.97      0.97      0.97     10000\n",
      "\n",
      "CPU times: user 8min, sys: 244 ms, total: 8min 1s\n",
      "Wall time: 8min 1s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# stored data set size of 50,000\n",
    "skl_knn(5, test_img1, test_target1, fifty_x, fifty_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "        0.0       0.98      0.99      0.98       997\n",
      "        1.0       0.95      0.99      0.97      1118\n",
      "        2.0       0.98      0.94      0.96      1041\n",
      "        3.0       0.94      0.97      0.95      1036\n",
      "        4.0       0.97      0.95      0.96       966\n",
      "        5.0       0.96      0.96      0.96       924\n",
      "        6.0       0.98      0.99      0.99       918\n",
      "        7.0       0.95      0.97      0.96      1053\n",
      "        8.0       0.99      0.90      0.94       977\n",
      "        9.0       0.93      0.95      0.94       970\n",
      "\n",
      "avg / total       0.96      0.96      0.96     10000\n",
      "\n",
      "CPU times: user 3min 24s, sys: 240 ms, total: 3min 24s\n",
      "Wall time: 3min 24s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# stored data set size of 20,000\n",
    "skl_knn(5, test_img1, test_target1, twenty_x, twenty_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sweet!  Our model matched the humans!  As you can see, when the model has more data to work with (50,000 instead of 20,000 points) it performs better.  One of the more remarkable things about this model, is that it is so simple, and yet could capture the complex relationships between unique images at the level of a human.\n",
    "\n",
    "To see a more in depth analysis, visit [this GitHub repository](https://github.com/samgrassi01/Cosine-Similarity-Classifier)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Scikit-Learn K-Nearest Neighbors Classifier on MNIST\n",
	"\n",
	"Lets jump right into it by importing the required libraries."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((70000, 784), (70000,))"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import numpy as np\n",
	"\n",
	"from sklearn import datasets, model_selection\n",
	"from sklearn.neighbors import KNeighborsClassifier\n",
	"from sklearn.metrics import classification_report\n",
	"\n",
	"mnist = datasets.fetch_mldata('MNIST original')\n",
	"data, target = mnist.data, mnist.target\n",
	"\n",
	"# make sure everything was correctly imported\n",
	"data.shape, target.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Building data sets\n",
	"Lets start the construction of the K-NN model by making a couple different data sets we can play with. We will make a function that can take the size of the dataset we want and return it. These data sets will be used by the model to classify our testing data. \n",
	"\n",
	"Lets construct some of the model's stored data sets below."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"# make an array of indices the size of MNIST to use for making the data sets.\n",
	"# This array is in random order, so we can use it to scramble up the MNIST data\n",
	"indx = np.random.choice(len(target), 70000, replace=False)\n",
	"\n",
	"# method for building datasets to test with\n",
	"def mk_dataset(size):\n",
	" \"\"\"makes a dataset of size \"size\", and returns that datasets images and targets\n",
	" This is used to make the dataset that will be stored by a model and used in \n",
	" experimenting with different stored dataset sizes\n",
	" \"\"\"\n",
	" train_img = [data[i] for i in indx[:size]]\n",
	" train_img = np.array(train_img)\n",
	" train_target = [target[i] for i in indx[:size]]\n",
	" train_target = np.array(train_target)\n",
	" \n",
	" return train_img, train_target"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Great. Now we will use this function to build two data sets of different sizes we can use to see how the model performs when it has different amounts of data available to it in making classifications. Hint: making a smaller data set is like taking some of the points away in the images in the tutorial, you will still be able to make classifications but the model will have fewer points to work with, making it harder to make correct classifications."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((50000, 784), (50000,))"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# lets make a dataset of size 50,000, meaning the model will have 50,000 data points to compare each \n",
	"# new point it is to classify to\n",
	"fifty_x, fifty_y = mk_dataset(50000)\n",
	"fifty_x.shape, fifty_y.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((20000, 784), (20000,))"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# lets make one more of size 20,000 and see how classification accuracy decreases when we use that one\n",
	"twenty_x, twenty_y = mk_dataset(20000)\n",
	"twenty_x.shape, twenty_y.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note how these data sets the model has needs the labels with it. The model needs these labels to understand what each point represents, and can thus put the point we are trying to classify, $p$, into a specific class as opposed to saying \"this is what this point is most similar to\", which you can't do as much with.\n",
	"\n",
	"Now we will build a testing data set of size 10,000. This is the data set we will run through the model, and see how the model does at classifying each point in the testing data set."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((10000, 784), (10000,))"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# build model testing dataset\n",
	"test_img = [data[i] for i in indx[60000:70000]]\n",
	"test_img1 = np.array(test_img)\n",
	"test_target = [target[i] for i in indx[60000:70000]]\n",
	"test_target1 = np.array(test_target)\n",
	"test_img1.shape, test_target1.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Great! Now that we have all of our data set up, we can start playin with the K-NN model!\n",
	"\n",
	"### Building the model\n",
	"We will start by putting the Scikit-Learn K-NN model into a function so we can easily call it and adjust it."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"def skl_knn(k, test_data, test_target, stored_data, stored_target):\n",
	" \"\"\"k: number of neighbors to use in classication\n",
	" test_data: the data/targets used to test the classifier\n",
	" stored_data: the data/targets used to classify the test_data\n",
	" \"\"\"\n",
	" \n",
	" classifier = KNeighborsClassifier(n_neighbors=k) \n",
	" classifier.fit(stored_data, stored_target)\n",
	"\n",
	" y_pred = classifier.predict(test_data) \n",
	"\n",
	" print(classification_report(test_target, y_pred))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Testing\n",
	"\n",
	"Now lets see how this model performs on the two different test sets."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0.0 0.98 0.99 0.99 997\n",
	" 1.0 0.96 1.00 0.98 1118\n",
	" 2.0 0.98 0.96 0.97 1041\n",
	" 3.0 0.96 0.98 0.97 1036\n",
	" 4.0 0.98 0.97 0.97 966\n",
	" 5.0 0.97 0.97 0.97 924\n",
	" 6.0 0.98 0.99 0.99 918\n",
	" 7.0 0.96 0.98 0.97 1053\n",
	" 8.0 0.99 0.92 0.96 977\n",
	" 9.0 0.96 0.96 0.96 970\n",
	"\n",
	"avg / total 0.97 0.97 0.97 10000\n",
	"\n",
	"CPU times: user 8min, sys: 244 ms, total: 8min 1s\n",
	"Wall time: 8min 1s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"# stored data set size of 50,000\n",
	"skl_knn(5, test_img1, test_target1, fifty_x, fifty_y)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0.0 0.98 0.99 0.98 997\n",
	" 1.0 0.95 0.99 0.97 1118\n",
	" 2.0 0.98 0.94 0.96 1041\n",
	" 3.0 0.94 0.97 0.95 1036\n",
	" 4.0 0.97 0.95 0.96 966\n",
	" 5.0 0.96 0.96 0.96 924\n",
	" 6.0 0.98 0.99 0.99 918\n",
	" 7.0 0.95 0.97 0.96 1053\n",
	" 8.0 0.99 0.90 0.94 977\n",
	" 9.0 0.93 0.95 0.94 970\n",
	"\n",
	"avg / total 0.96 0.96 0.96 10000\n",
	"\n",
	"CPU times: user 3min 24s, sys: 240 ms, total: 3min 24s\n",
	"Wall time: 3min 24s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"# stored data set size of 20,000\n",
	"skl_knn(5, test_img1, test_target1, twenty_x, twenty_y)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Sweet! Our model matched the humans! As you can see, when the model has more data to work with (50,000 instead of 20,000 points) it performs better. One of the more remarkable things about this model, is that it is so simple, and yet could capture the complex relationships between unique images at the level of a human.\n",
	"\n",
	"To see a more in depth analysis, visit [this GitHub repository](https://github.com/samgrassi01/Cosine-Similarity-Classifier)."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda root]",
	"language": "python",
	"name": "conda-root-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}