Skip to content

Instantly share code, notes, and snippets.

@benvandyke
Created January 1, 2014 23:34
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save benvandyke/8212869 to your computer and use it in GitHub Desktop.
My kNN and PCA implementation for the Kaggle MNIST competition.
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Applying kNN and PCA to the [Kaggle Digit Recognizer](https://www.kaggle.com/c/digit-recognizer) Contest Using MNIST Data\n",
"\n",
"Ben Van Dyke, January 2014"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This IPython notebook shows my initial solution to the Kaggle Digit Recognizer Contest. I use various sklearn packages to perform PCA to reduce dimensionality, normalize the training and test data, perform cross validation on the training data and finally classify the test data. My submission in the contest ended up with a 0.96786 score, better than the benchmark kNN score. The performance is great considering the simplicity and readability of this implementation."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn import cross_validation\n",
"import matplotlib.pyplot as plt\n",
"from __future__ import print_function"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 57
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Preprocessing"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# load the train and test data\n",
"train = np.loadtxt('train.csv',delimiter=',',skiprows=1)\n",
"test = np.loadtxt('test.csv',delimiter=',',skiprows=1)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# separate labels from training data\n",
"train_data = train[:,1:]\n",
"train_labels = train[:,0]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# select number of components to extract\n",
"pca = PCA(n_components=40)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# fit to the training data\n",
"pca.fit(train_data)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"PCA(copy=True, n_components=40, whiten=False)"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# determine amount of variance explained by components\n",
"np.sum(pca.explained_variance_ratio_)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"0.78715300463419802"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# plot the explained variance\n",
"plt.plot(pca.explained_variance_ratio_)\n",
"plt.title('Variance Explained by Extracted Componenent')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtYVNXeB/DvIKjcb3KdQTABxRuQGKaZ4y3NDC1v6FG0\nwMwyxepUni5ip9e0Xt83k8pLhqKJnnqfghSpTCmPHaASs8ILqCigaCKoCHIZ1vvHjoERGBCBPcN8\nP8+zn7nstff89mL47T1rr722QgghQEREnZ6Z3AEQEVHHYMInIjIRTPhERCaCCZ+IyEQw4RMRmQgm\nfCIiE8GE30q2trbIzc2VO4x2FxMTg7lz57ao7KJFi/DWW2+1SxxmZmY4c+ZMo/PUajW2bNnSLp/b\nWWzduhUjRoyQOwySmUkk/AkTJmDFihUN3k9MTISHhwdqamrueJ03btyAj49PG0TXNtRqNSwtLWFr\na6udJk+efNfrVSgULS770Ucf4bXXXrvrz7xTCoXijuJsSm5uLszMzHTq0NbWFp999lmzy97JjrE1\n2nunVllZiZiYGPj7+8PGxga9evVCZGQkzp07126f2VmkpqbCy8tL7jBaxCQS/vz587Fjx44G72/f\nvh1z5syBmVnLq6G6urotQ2szCoUCH3zwAW7cuKGdEhMT73q9pnhd3rVr13Tqcfr06Xe9TiHEXdVl\nW+zQ9Jk2bRr27NmDhIQEXL9+Hb/++itCQkLw3XfftevnUscyiYQ/efJkFBUV4dChQ9r3iouLsXfv\nXkRERCAjIwP3338/HB0d4enpieeeew5VVVXasmZmZvjwww/h5+eHPn36aN+rbWLYu3cvgoODYW9v\nj549e2LlypXaZWuPGuPj4+Ht7Q0XFxesWrVKO7+mpgarVq2Cr68v7OzsEBISgvz8fADAiRMnMG7c\nODg7O6Nv374tOtJszJo1azB06FBoNBoA0pH4gAEDUFlZqY1v8+bNUCqV8PT0xNq1a5tc1/Tp0+Hh\n4QEHBweMHDkSWVlZ2nnz58/H66+/DkA66lGpVPif//kfuLm5wdPTE1u3btWWraiowIsvvghvb2+4\nu7tj0aJFuHXrlnb+u+++C09PT6hUKnzyySfNbmNOTg5CQ0Nhb2+PKVOmoLi4GADwyCOPIDY2Vqfs\noEGD7nhnWFlZieDgYO26NBoNhg8fjrfeegtff/013n77bezevRu2trYIDg4GIB2Vv/baaxg+fDis\nra1x5swZxMXFoV+/frCzs0Pv3r2xadMmnc9JTExEUFAQ7O3t4evri6+//hqvvvoqDh06hMWLF8PW\n1hZLliwBoP/7UVRUhLCwMNjb2yM0NBSnT59uctv279+P/fv3IzExEYMHD4aZmRns7OywaNEiPPnk\nkwCACxcuICwsDM7OzvDz88PHH3+sXT4mJgbTp0/H3LlzYWdnh0GDBiE7Oxtvv/023Nzc4O3tjW+/\n/VZbXq1WY/ny5Y3+vQAgKSkJ/fv3h6OjI0aNGoUTJ05o5/n4+GDt2rUIDAyEg4MDwsPDUVFRoZ2/\nZ88eBAUFwdHREcOHD8dvv/3WrsvevHkTDz/8MC5cuABbW1vY2dmhsLCwybqWnTARCxYsEFFRUdrX\nGzZsEMHBwUIIIX755ReRnp4uNBqNyM3NFQEBAeK9997TllUoFOKhhx4SxcXF4tatW9r3Tp8+LYQQ\nIjU1Vfz+++9CCCGOHTsm3NzcxJdffimEEOLs2bNCoVCIp556Sty6dUv8+uuvolu3buLEiRNCCCHe\neecdMXDgQHHq1Cnt8kVFRaK0tFSoVCqxdetWodFoRGZmpujRo4fIyspqdPvUarX4+OOPG51XU1Mj\nHnzwQRETEyNOnTolHB0dxdGjR3Ximz17tigrKxO//fabcHFxEfv37xdCCLFixQoxZ84c7bri4uJE\naWmpqKysFNHR0SIoKEg7b/78+eL1118XQghx8OBBYW5uLlasWCGqq6tFcnKysLKyEiUlJUIIIaKj\no8XkyZNFcXGxuHHjhnj00UfF8uXLhRBC7Nu3T7i5uYk//vhD3Lx5U8yaNUunvm83cuRIoVQqteWn\nTp2qjflf//qXCA0N1ZY9evSocHZ2FlVVVQ3WU1sX1dXVjX7O77//LhwdHcXx48fFW2+9Je6//35R\nU1MjhBAiJiZGzJ07t0Fc3t7eIisrS2g0GlFVVSX27t0rzpw5I4QQ4vvvvxdWVlbiyJEjQggh0tPT\nhb29vbbuCwoKtN8TtVottmzZol13c9+PmTNnipkzZ4qysjLx+++/C6VSKUaMGNHodr388stCrVY3\nOq/WiBEjxLPPPisqKirE0aNHhYuLizhw4IAQQvqOdO/eXXzzzTeiurpaRERECG9vb7Fq1SpRXV0t\nNm/eLHr16qVTL039vU6ePCmsra3F/v37RXV1tXjnnXeEr6+v9u/l4+MjQkNDxcWLF8XVq1dFQECA\n2LBhgxBCiCNHjghXV1eRkZEhampqxLZt24SPj4+orKxs12VTU1OFSqXSW3+GwmQS/r///W/h4OAg\nKioqhBBCDBs2TCep1/e///u/4rHHHtO+VigU4uDBgzpl9CWgpUuXimXLlgkh6pJIQUGBdv59990n\ndu/eLYQQwt/fXyQlJTVYx65duxr8gz711FNi5cqVjX7myJEjhZWVlXBwcNBOb7zxhnZ+bm6ucHJy\nEgEBAWL16tXa92vjO3nypPa9l156SURGRgohGib8+oqLi4VCoRDXr18XQkgJ/7XXXhNCSAnf0tJS\naDQabXlXV1eRnp4uampqhLW1tU79/fjjj9qk8MQTT2iTvxBCnDp1Sm99q9VqnfJZWVmia9euoqam\nRpSXlwtHR0eRk5MjhBDihRdeEM8++2yj66mti/p16ODgoE26Qgixdu1a4e/vL5ycnLTrbKqe1Gq1\nWLFiRaOfVWvKlCli3bp1Qgjp7/v88883uY31d+j6vh/V1dXCwsJC52/6j3/8QzzwwAONrjsqKkqE\nh4c3GeP58+dFly5dRGlpqfa95cuXi/nz5wshpG1/6KGHtPOSkpKEjY2Ndmd4/fp1oVAoxLVr17Tb\n0tjfS6PRiDfffFPMnDlTO6+mpkYolUrx/fffCyGkxPvpp59q57/00kvi6aefFkII8fTTT2sPOGr1\n6dNH/PDDD+267MGDB40m4ZtEkw4ADB8+HD169MAXX3yB06dP46effsLs2bMBAKdOncKkSZPg4eEB\ne3t7vPrqqygqKtJZXt9JmfT0dIwaNQqurq5wcHDAxo0bGyzv7u6ufW5lZYXS0lIAQH5+Pnr37t1g\nnefOnUN6ejocHR21086dO3Hp0qVGY1AoFFi/fj2Ki4u1U/2mJW9vb6jVapw7dw7PPvtsg+Xrb1/P\nnj1x4cKFBmU0Gg1eeeUV+Pr6wt7eHr169QIAXLlypdGYnJ2ddc6P1G73n3/+ibKyMgwePFi7bQ8/\n/LB2PRcvXmwQT3NuL19VVYUrV66ge/fumDFjBrZv3w4hBHbt2tXsydWioiKdeqxtxgOAiIgInD9/\nHhMnTmz076YvLgDYt28fhg4dCmdnZzg6OiI5OVn7XWnqu1Crfju+vu/HlStXUF1d3eI67NGjBy5e\nvNjk/AsXLsDJyQnW1tY66ysoKNC+dnV11T63tLREjx49tPFaWloCgPY7DzT997p48aJOrAqFAl5e\nXjqfVf9/ydLSUrvec+fOYe3atTp1kp+fr/Ndbo9ljYnJJHxA+meNj4/Hjh07MGHCBLi4uACQuhP2\n69cPOTk5uHbtGv7rv/6rQc8dfSfNZs+ejSlTpiA/Px8lJSV4+umnW9zzx8vLCzk5OQ3e79mzJ0aO\nHKmTeG7cuIEPPvjgDra4zt69e5GWloYxY8bgxRdfbDD//PnzOs+VSmWDMjt37kRSUhK+++47XLt2\nDWfPngWge2K3JScXe/ToAUtLS2RlZWm3raSkBNevXwcAeHh4NIinObeXt7CwQI8ePQAA8+bNw6ef\nfor9+/fDysoKoaGhza6vKc888wwmTZqElJQUHD58WPt+Uyf+69dHRUUFpk6dipdeegmXL19GcXEx\nJk6cqK2/pr4Lt68H0P/96NGjB8zNzVtch2PHjkVGRoZOUq3P09MTV69e1Ulw58+fh0qlanKdzWns\n7+Xi4gJPT0+dnkFCCOTl5TX6faxVWzc9e/bEq6++qlMnpaWlmDlzZrsse/s6jIHJJfxvv/0WH3/8\nMebNm6d9v7S0FLa2trCyssKJEyfw0Ucf3dF6S0tL4ejoiK5duyIjIwM7d+5s8ZcgKioKr7/+OnJy\nciCEwLFjx3D16lVMmjQJp06dwo4dO1BVVYWqqir89NNPOiewbiea6AVy5coVLFiwAFu2bMHWrVvx\n1VdfYd++fTpl3nrrLZSXl+OPP/7A1q1bG/2il5aWolu3bnBycsLNmzfxj3/8o8HnNxVDfWZmZliw\nYAGio6Px559/AgAKCgrwzTffAABmzJiBrVu34vjx4ygrK9P5pdLUdu/YsUNb/o033sD06dO1f4P7\n778fCoUCL774IiIiIpqNr6lt2L59OzIzM7Ft2za8//77mDdvHm7evAkAcHNzQ25uboNl67+urKxE\nZWUlevToATMzM+zbt0+7zQAQGRmJuLg4HDhwADU1NSgoKMDJkye1669/4lXf96NLly54/PHHERMT\ng/LycmRlZWHbtm1NfifHjBmDcePG4bHHHsORI0dQXV2NGzduYMOGDYiLi4OXlxeGDRuG5cuXo6Ki\nAseOHcMnn3yCOXPmNFuXTdVvU3+v6dOnY+/evThw4ACqqqqwdu1adO/eHcOGDdO7PgBYsGABNmzY\ngIyMDAghcPPmTezdu1fvkfjdLFvLzc0NRUVF2gMWQ2ZSCd/b2xvDhw9HWVkZwsLCtO//93//N3bu\n3Ak7Ozs89dRTCA8P1/nnaOwfpf57H374Id544w3Y2dnhn//8Z4NkqS/5P//885gxYwYeeugh2Nvb\nY8GCBbh16xZsbGzwzTffYNeuXVAqlfDw8MDy5ctRWVnZ5Lpqe3HUTkOGDAEALFy4EFOmTMGECRPg\n5OSELVu2ICoqSqdnxMiRI+Hr64uxY8fi73//O8aOHauNvTb+iIgIeHt7Q6lUYsCAAdpEWn87m6u3\nWmvWrIGvry+GDh0Ke3t7jBs3DqdOnQIgXTcRHR2N0aNHw9/fH2PGjNG7LoVCgYiICMyfPx8eHh6o\nrKzE+++/r1MmIiICv/32W4uSlIODg049vvfee8jLy8OyZcsQHx8PKysrzJo1CyEhIXj++ecBQNt1\n09nZGSEhIY3Wga2tLd5//33MmDEDTk5OSEhI0LlWYsiQIYiLi8OyZcvg4OAAtVqtPRJeunQpPv/8\nczg5OSE6OrrZ70dsbCxKS0vh7u6OJ598Utvbpimff/45Jk6ciJkzZ8LBwQEDBw7EkSNHMG7cOABA\nQkICcnNz4enpiccffxxvvvkmRo8erd3G2/8++l4rFArMnTu30b9Xnz59sGPHDjz33HNwcXHB3r17\n8dVXX8Hc3LzRuOt/9uDBg7F582YsXrwYTk5O8PPzQ3x8fJPfnbZatm/fvpg1axbuueceODk5GXQv\nHYVoySEZdVq5ubm45557UF1dfUfXIxib7du3Y/Pmzfjhhx/kDsXkjRo1CnPnzm12J0Rtr9n/8JSU\nFPTt2xd+fn5Ys2ZNg/knTpzA/fffj+7duzfov93cskQdoaysDB988AGeeuopuUOhv/A4Ux56E75G\no8HixYuRkpKCrKwsJCQk4Pjx4zplnJ2dsX79+gYnAluyLBkGYzrpdKe+/vpruLq6wsPDQ9sri+TX\nmb9zhqzxhrG/ZGRkwNfXVztmTHh4OBITExEQEKAt4+Liom1ru9NlSX4+Pj7aK3A7o/Hjxxtl97nO\n7ODBg3KHYLL0JvyCggKd/rIqlQrp6ektWnFLluVenoiodVrTLKa3SeduEnJLl63tymfI04oVK2SP\ngXEyTmOO0xhiNKY4W0tvwlcqlcjLy9O+zsvLa/HFFnezLBERtT29CT8kJATZ2dnIzc1FZWUldu/e\nrdN/vb7b9zp3siwREbU/vW345ubmiI2Nxfjx46HRaBAZGYmAgABs3LgRgHRBT2FhIYYMGYLr16/D\nzMwM69atQ1ZWFmxsbBpd1hip1Wq5Q2gRxtm2GGfbMYYYAeOJs7VkvfBKoVDg++8FHnxQrgiIiIyP\nQqFoVVu+7Anf21vABG4NS0TUZlqb8GW/lr6wEKh3oyMiImonsif8nj2Bv+4USERE7Uj2hO/nB2Rn\nyx0FEVHnx4RPRGQimPCJiEwEEz4RkYkwiIT/142OiIioHcneD7+6WsDaGrh6FbCykisSIiLjYbT9\n8Lt0AXr1AnJy5I6EiKhzkz3hA4C/P9vxiYjam0EkfJ64JSJqf0z4REQmggmfiMhEMOETEZkIg0j4\nSiVw7Rpw44bckRARdV4GkfDNzIDevdk1k4ioPRlEwgfYrENE1N6Y8ImITIRBJXyOqUNE1H4MKuHz\nCJ+IqP0w4RMRmQiDSfgeHkB5OVBSInckRESdk8EkfIWCR/lERO3JYBI+wIRPRNSemPCJiEwEEz4R\nkYlgwiciMhFM+EREJsKgEr6LC6DRAEVFckdCRNT5GFTCZ9dMIqL2Y1AJH+CYOkRE7cUgEz6P8ImI\n2h4TPhGRiTC4hO/vz4RPRNQemk34KSkp6Nu3L/z8/LBmzZpGyyxZsgR+fn4IDAxEZmam9v23334b\n/fv3x8CBAzF79mxUVFQ0G1DtEb4Qd7AVRETULL0JX6PRYPHixUhJSUFWVhYSEhJw/PhxnTLJycnI\nyclBdnY2Nm3ahEWLFgEAcnNzsXnzZhw5cgS//fYbNBoNdu3a1WxATk6AhQVw+fJdbBURETWgN+Fn\nZGTA19cXPj4+sLCwQHh4OBITE3XKJCUlYd68eQCA0NBQlJSU4NKlS7Czs4OFhQXKyspQXV2NsrIy\nKJXKFgXFdnwiorZnrm9mQUEBvLy8tK9VKhXS09ObLVNQUIB7770XL7zwAnr27AlLS0uMHz8eY8eO\nbfAZMTEx2udqtRpqtVqb8B94oLWbRUTUeaSmpiI1NfWu16M34SsUihatRDTS4H769Gm89957yM3N\nhb29PaZPn45PP/0Uf/vb33TK1U/4tXiET0RUp/ZguNbKlStbtR69TTpKpRJ5eXna13l5eVCpVHrL\n5OfnQ6lU4ueff8awYcPg7OwMc3NzPP744/jxxx9bFBQTPhFR29Ob8ENCQpCdnY3c3FxUVlZi9+7d\nCAsL0ykTFhaG+Ph4AEBaWhocHBzg5uaGPn36IC0tDeXl5RBCYP/+/ejXr1+LgmLCJyJqe3qbdMzN\nzREbG4vx48dDo9EgMjISAQEB2LhxIwBg4cKFmDhxIpKTk+Hr6wtra2vExcUBAIKCghAREYGQkBCY\nmZnh3nvvxVNPPdWioPz8gJwcqWtmC1uViIioGQrRWAN8R324QtFo+z8AuLoCR48Cnp4dHBQRkYHT\nlzv1MbgrbWtxEDUiorZl0Amf7fhERG3HYBM+x9QhImpbBpvweYRPRNS2mPCJiEyEwfbSKS2V7nF7\n8yZgZrC7JSKijtfpeunY2ACOjkB+vtyREBF1Dgab8AE26xARtSUmfCIiE8GET0RkIpjwiYhMBBM+\nEZGJMNhumQBQVibd47a0FDDXO64nEZHp6HTdMgHAykrqi3/unNyREBEZP4NO+AAwYgSQkiJ3FERE\nxs/gE35EBLBtm9xREBEZP4NP+OPGSVfbHj8udyRERMbN4BN+ly7AnDnAX7fNJSKiVjLoXjq1fv8d\nmDBBOnnbpUsHBEZEZMA6ZS+dWgMGAG5uwMGDckdCRGS8jCLhA8C8eTx5S0R0N4yiSQcA/vxTuvI2\nLw+wtW3nwIiIDFinbtIBpAuwRo4E/u//5I6EiMg4GU3CB9isQ0R0N4ymSQcAKioApRL4+WfAx6f9\n4iIiMmSdvkkHALp1A2bOBHbskDsSIiLjY1QJH5CadeLjAfl+lxARGSejS/hDhgBmZkBamtyREBEZ\nF6NL+AoFT94SEbWGUZ20rZWXBwQFAQUFQPfu7RAYEZEBM4mTtrW8vIDgYOCrr+SOhIjIeBhlwgfY\nrENEdKeMskkHkO5z6+UFnDghDaxGRGQqTKpJBwBsbIDJk4GdO+WOhIjIOBhtwgfYrENEdCeMOuGP\nHAkUFwO//ip3JEREhq/ZhJ+SkoK+ffvCz88Pa9asabTMkiVL4Ofnh8DAQGRmZmrfLykpwbRp0xAQ\nEIB+/fohrY2vljIzk25yvmVLm66WiKhT0pvwNRoNFi9ejJSUFGRlZSEhIQHHb7ubeHJyMnJycpCd\nnY1NmzZh0aJF2nlLly7FxIkTcfz4cRw7dgwBAQFtvgHPPCONrXPpUpuvmoioU9Gb8DMyMuDr6wsf\nHx9YWFggPDwciYmJOmWSkpIwb948AEBoaChKSkpw6dIlXLt2DYcOHcKTTz4JADA3N4e9vX2bb4CH\nBzB7NrB2bZuvmoioUzHXN7OgoABeXl7a1yqVCunp6c2Wyc/PR5cuXeDi4oInnngCv/76KwYPHox1\n69bByspKZ/mYmBjtc7VaDbVafccb8fLL0pW3L70E9Ohxx4sTERm01NRUpKam3vV69CZ8hULRopXc\n3h9UoVCguroaR44cQWxsLIYMGYLo6GisXr0ab775pk7Z+gm/tby8gGnTgPfeA956665XR0RkUG4/\nGF65cmWr1qO3SUepVCIvL0/7Oi8vDyqVSm+Z/Px8KJVKqFQqqFQqDBkyBAAwbdo0HDlypFVBtsQr\nrwAffST12iEioob0JvyQkBBkZ2cjNzcXlZWV2L17N8LCwnTKhIWFIT4+HgCQlpYGBwcHuLm5wd3d\nHV5eXjh16hQAYP/+/ejfv387bQbQqxcQFga8/367fQQRkVFrdmiFffv2ITo6GhqNBpGRkVi+fDk2\nbtwIAFi4cCEAaHvyWFtbIy4uDvfeey8A4Ndff0VUVBQqKyvRu3dvxMXF6Zy4vZuhFRqTnQ0MGwac\nPg3Y2bXZaomIDEprc6fRjqXTlDlzgP79geXL23S1REQGgwn/L8ePA2q1dJRvY9OmqyYiMggmN3ha\nUwICpCEXNmyQOxIiIsPS6Y7wAeDYMWD8eODMGcDSss1XT0QkKx7h1zNoEBAaCmzeLHckRESGo1Me\n4QPAL79I4+Xn5PC+t0TUufAI/zaDBwOBgUBcnNyREBEZhk57hA8AaWnAzJlS//yuXdvtY4iIOhSP\n8BsxdCjg7w/8dSEwEZFJ69RH+ADwww/SrRCPHQNsbdv1o4iIOgQvvNJjwQLg+nVg1y6ghQOAEhEZ\nLDbp6LF+vdRbZ/16uSMhIpKPSRzhA8DZs1Kb/hdfSAOsEREZKx7hN6NXL+lm5zNnApcvyx0NEVHH\nM5mEDwCTJkkncMPDgepquaMhIupYJpXwAWDlSsDMDHjjDbkjISLqWCaX8Lt0ARISgB07gKQkuaMh\nIuo4JnPS9nZpadItEf/zH6B3b1lCICJqFZ60vUNDh0rNOlOnAuXlckdDRNT+TPYIHwCEAP72N2nM\n/C1bZAuDiOiO8Ai/FRQKYNMmqXln5065oyEial8mfYRf66efgEcfBX7/HejRQ+5oiIj041g6d2nZ\nMqC4GNi6Ve5IiIj0Y8K/S6WlwIABwMcfA2PHyh0NEVHT2IZ/l2xsgA8/BJ5+GigrkzsaIqK2xyP8\n24SHAz4+wOrVckdCRNQ4Num0kcJCYNAg4NtvpXviEhEZGjbptBF3d+Dtt6Wbpmg0ckdDRNR2mPAb\n8eSTgJUVEBsrdyRERG2HTTpNOHVKulHKkSNAz55yR0NEVIdNOm3M3x+IjgaeeUYagoGIyNgx4evx\n0ktAbi7w2WdyR0JEdPfYpNOM//xHGlHzjz8AR0e5oyEiYrfMdrV4sfTIk7hEZAiY8NvRhQtA//5A\nQYHUe4eISE48aduOPD2B++7jLRGJyLgx4bdQRAQQHy93FERErddswk9JSUHfvn3h5+eHNWvWNFpm\nyZIl8PPzQ2BgIDIzM3XmaTQaBAcH49FHH22biGUyZYp0ArewUO5IiIhaR2/C12g0WLx4MVJSUpCV\nlYWEhAQcP35cp0xycjJycnKQnZ2NTZs2YdGiRTrz161bh379+kGhULR99B3I2hqYPBlISJA7EiKi\n1tGb8DMyMuDr6wsfHx9YWFggPDwciYmJOmWSkpIwb948AEBoaChKSkpw6dIlAEB+fj6Sk5MRFRVl\nFCdnmzN3LrB9u9xREBG1jrm+mQUFBfDy8tK+VqlUSE9Pb7ZMQUEB3NzcsGzZMrz77ru4fv16k58R\nExOjfa5Wq6FWq+9wEzqOWg38+ad0K8QBA+SOhohMRWpqKlJTU+96PXoTfkubYW4/ehdCYM+ePXB1\ndUVwcLDeQOsnfEPXpQvwt79JR/lNnM4gImpztx8Mr1y5slXr0duko1QqkZeXp32dl5cHlUqlt0x+\nfj6USiV+/PFHJCUloVevXpg1axYOHDiAiIiIVgVpSObOBXbs4NDJRGR89Cb8kJAQZGdnIzc3F5WV\nldi9ezfCwsJ0yoSFhSH+r/6KaWlpcHBwgLu7O1atWoW8vDycPXsWu3btwujRo7XljFn//oCHB3Dg\ngNyREBHdGb1NOubm5oiNjcX48eOh0WgQGRmJgIAAbNy4EQCwcOFCTJw4EcnJyfD19YW1tTXi4uIa\nXZex99Kpr/bk7bhxckdCRNRyHFqhFS5floZPzs+Xbn5ORNSROLRCB3J1BR54APjiC7kjISJqOSb8\nVuJQC0RkbNik00rl5YBSCfz2m/RIRNRR2KTTwSwtpRujfPqp3JEQEbUME/5dmDtXatYx0h8pRGRi\nmPDvwgMPADdvAkePyh0JEVHzmPDvgpkZMGcOB1QjIuPAk7Z36dQp4MEHpT755novYyMiahs8aSsT\nf3/Axwf45hu5IyEi0o8Jvw1ERLBZh4gMH5t02kBREdC7N/Dvf3OcfCJqf2zSkZGzM7BhA/DQQ8CJ\nE3JHQ0TUOJ5mbCPh4UBlJTB2LHDwIODnJ3dERES6mPDbUEREXdJPTQV69ZI7IiKiOkz4bSwqSkr6\nY8ZISb/H9BGCAAAPfklEQVRnT7kjIiKSMOG3g2eekZL+6NHA999zcDUiMgxM+O0kOlr3SN/dXe6I\niMjUMeG3o5deAioq6pK+i4vcERGRKWO3zHb2+uvAY49JJ3Lz8+WOhohMGY/wO8A//wl07QoMHAgE\nBgLTp0tj6bOZh4g6Eq+07UC3bklj7nz2GbBnj5T8Z8wAHn+cyZ+IWq61uZMJXya3bgFff12X/IOD\ngaefBmbOlDsyIjJ0TPhGrDb5L10KrFwJzJsnd0REZMiY8DuB48cBtRrYuhV4+GG5oyEiQ8XB0zqB\ngADgiy+kIRoyMuSOhog6GyZ8AzNsGLBlCzB5MpCdLXc0RNSZsFumAQoLAy5dAiZMAA4fZg8eImob\nTPgGasEC4MIF4JFHpKt0bW3ljoiIjB1P2howIaSummfPSl03u3aVOyIiMgTspdNJVVdLV+Xa2gLx\n8YAZz7oQmTz20umkzM2BhATpKP/ll+WOhoiMGY/wjcTVq8CoUYC/P/DBB4Crq9wREZFceITfyTk5\nAenpwD33AIMGAf/6l9wREZGx4RG+EUpPB+bPBwYM4NE+kSniEb4JCQ0FMjN1j/a53ySi5jSb8FNS\nUtC3b1/4+flhzZo1jZZZsmQJ/Pz8EBgYiMzMTABAXl4eRo0ahf79+2PAgAF4//332zZyE9e9O7Bm\nDZCYCMTESGPsX74sd1REZMj0NuloNBr06dMH+/fvh1KpxJAhQ5CQkICAgABtmeTkZMTGxiI5ORnp\n6elYunQp0tLSUFhYiMLCQgQFBaG0tBSDBw/Gl19+qbMsm3Taxq1bUtLfuhWIjAR69gRUKunm6SoV\n4OwMKBRyR0lEbaVdmnQyMjLg6+sLHx8fWFhYIDw8HImJiTplkpKSMO+v8XxDQ0NRUlKCS5cuwd3d\nHUFBQQAAGxsbBAQE4MKFC3ccIDWve3dg9eq6i7N+/llq24+IAPr0ASwtgd69gQcfBBYvBsrK5I6Y\niOSgd2iFgoICeHl5aV+rVCqkp6c3WyY/Px9ubm7a93Jzc5GZmYnQ0NAGnxETE6N9rlaroVar73Qb\n6C8hIdJ0u/JyoKBAmjZvlsbo2bMHsLPr+BiJ6M6lpqYiNTX1rtejN+ErWtgOcPtPi/rLlZaWYtq0\naVi3bh1sbGwaLFs/4VP7sLQEfH2lacQI4LnngDFjgJQUqbmHiAzb7QfDK1eubNV69DbpKJVK5OXl\naV/n5eVBpVLpLZOfnw+lUgkAqKqqwtSpUzFnzhxMmTKlVQFS2zIzA2JjgdGjpZutFBbKHRERdRS9\nCT8kJATZ2dnIzc1FZWUldu/ejbCwMJ0yYWFhiI+PBwCkpaXBwcEBbm5uEEIgMjIS/fr1Q3R0dPtt\nAd0xhUJq8585U2rXP39e7oiIqCPobdIxNzdHbGwsxo8fD41Gg8jISAQEBGDjxo0AgIULF2LixIlI\nTk6Gr68vrK2tERcXBwA4fPgwduzYgUGDBiE4OBgA8Pbbb2PChAntvEnUEgoF8NprgI2NlPS//Rbw\n85M7KiJqT7zSlvDxx1K3zpQU6epdIjJsrc2dvAEKISoKsLYGxo4F9u4FBg+WOyIiag9M+AQAmDUL\nsLICHn4YGD5c6s/frZs01T6vffTyAiZOBDw95Y6aiO4Em3RIxx9/ACdPApWVQEWFNN3+/ORJ4Ouv\ngV69gEmTpGnwYN6chaij8I5X1KGqq4Eff5Qu4PrqK6C4WLr/7qOPSk1DjVxyQURthAmfZJWTI7X/\n79kDpKVJR/yjR0sXeN13H2BhIXeERJ0HEz4ZjBs3gH//G/juO2k6cwZ44AEp+Y8ZAwwcyOYforvB\nhE8G68oV4OBB4MABaQdQXAzMmAEsWSIN7kZEd4YJn4zGuXNS3/+NG4EhQ4ClS4Fx4ziEM1FLMeGT\n0SkvBxISgPfeAzQaKfHPmSN1DyWipjHhk9ESQmryWbdO6vkTFSWN5e/tzeRP1BgmfOoUcnKA9euB\npCTg4kXp5i4eHnWTp6f0qFRKvX969ZI7YqKOx4RPnY4Q0gneixel6cKFuufnzwOHD0tj/Y8aJU1q\ntXR7R6LOjgmfTI4QwIkTUnPQwYNAaqp0F6/a5B8UBLi7A05O7AZKnQsTPpm8mhogK6tuB3DypHSD\nlxs3AFdXKfm7u0tNQu7u0g3eQ0KAQYN4YRgZFyZ8oiZUVACXL0tNQYWFddPZs8BPPwG5uUBgIBAa\nWjd5e7ObKBkuJnyiVrp+Hfj5ZyA9vW6qqZES/4AB0sVhffoA/v5S8xCR3JjwidqIEEBeHpCRITUR\nnTxZN3XvXrcDqN0J3HOPNHHAOOooTPhE7UwIqVmo/g4gO1tqGjp7FrC1BXr3rtsB1D739ZXOGbCJ\niNoKEz6RjGpqpPMCZ85I0+nTdY+nTwOlpXXJv3fvusfevaWTyJaWcm8BGRMmfCIDduOGlPhzcho+\nXr4slXF0lCYnp7rnjo5SD6NevaTpnnuk1/y1YNqY8ImMWHk5cPWqdKFZ/enqVeDSpbpmozNnpLL1\ndwA+PoCLC+DsLO0snJyk5/b2QJcucm8ZtQcmfCITcf267g4gN1cagvrqVWkqKpIer1+Xkr6zM9Cj\nh3Qewc2t8Ud3d45bZEyY8IlIR3U1UFIi7QD+/FP6pVBYKD3Wf157XUK3brrjFtUfu8jTU/pF4eXF\nq5YNARM+EbWaENLOoXasotvHL7pwQfo1UVRUd/LZz093Uiq5M+goTPhE1O5u3pRONmdnN5z+/BNw\ncJCaj2qbkeo/d3KSxjqytW042dlJvzCoZZjwiUhW1dXSuYMrV6RfArc/FhVJvZVqp+vXdV8DUg8k\npVJ38vSse+7iIu0YuncHzM3l3V45MeETkVGrqJDOKRQUND1duSKVq6iQlunWTXeytJROVDs6Sr82\naru21n9e+7p2src3vp0HEz4RmZTq6rrkf+tW3eO1a7pdW0tKdF9fuya9VztduybtKGp3AI6O0i8J\nV9e6x/rPXVykcl27yrftTPhERK0ghHQldO0O4OpV6XzE5cvSVP957euSEmlIbXv7usnOru65jY3U\n7NS9u7QzqX1ef+raVZq6dat7Xv89NzepXGOY8ImIOogQQFmZ9Ovg+nXpsf5086b0a6OpqbwcqKxs\nOFVU1D1PSACGDWv885nwiYhMRGtzJ3vNEhGZCCZ8IiITwYRPRGQimPCJiEwEE34LpKamyh1CizDO\ntsU4244xxAgYT5yt1WzCT0lJQd++feHn54c1a9Y0WmbJkiXw8/NDYGAgMjMz72hZY2AsXwLG2bYY\nZ9sxhhgB44mztfQmfI1Gg8WLFyMlJQVZWVlISEjA8ePHdcokJycjJycH2dnZ2LRpExYtWtTiZYmI\nqOPoTfgZGRnw9fWFj48PLCwsEB4ejsTERJ0ySUlJmDdvHgAgNDQUJSUlKCwsbNGyRETUgYQen332\nmYiKitK+3r59u1i8eLFOmUmTJonDhw9rX48ZM0b8/PPP4vPPP292WQCcOHHixKkVU2voHSNO0cI7\nJbf2alleZUtE1HH0JnylUom8vDzt67y8PKhUKr1l8vPzoVKpUFVV1eyyRETUcfS24YeEhCA7Oxu5\nubmorKzE7t27ERYWplMmLCwM8fHxAIC0tDQ4ODjAzc2tRcsSEVHH0XuEb25ujtjYWIwfPx4ajQaR\nkZEICAjAxo0bAQALFy7ExIkTkZycDF9fX1hbWyMuLk7vskREJJNWtfy3gX379ok+ffoIX19fsXr1\narnCaJa3t7cYOHCgCAoKEkOGDJE7HK0nnnhCuLq6igEDBmjfKyoqEmPHjhV+fn5i3Lhxori4WMYI\nJY3FuWLFCqFUKkVQUJAICgoS+/btkzFCyfnz54VarRb9+vUT/fv3F+vWrRNCGF6dNhWnIdVpeXm5\nuO+++0RgYKAICAgQr7zyihDC8OqyqTgNqS7rq66uFkFBQWLSpElCiNbVpywJv7q6WvTu3VucPXtW\nVFZWisDAQJGVlSVHKM3y8fERRUVFcofRwA8//CCOHDmik0j//ve/izVr1gghhFi9erV4+eWX5QpP\nq7E4Y2JixNq1a2WMqqGLFy+KzMxMIYQQN27cEP7+/iIrK8vg6rSpOA2tTm/evCmEEKKqqkqEhoaK\nQ4cOGVxdCtF4nIZWl7XWrl0rZs+eLR599FEhROv+32UZWsHY+ugLA+xNNGLECDg6Ouq8V/+aiHnz\n5uHLL7+UIzQdjcUJGF6duru7IygoCABgY2ODgIAAFBQUGFydNhUnYFh1amVlBQCorKyERqOBo6Oj\nwdUl0HicgGHVJSB1hklOTkZUVJQ2ttbUpywJv6CgAF5eXtrXKpVK+6U1NAqFAmPHjkVISAg2b94s\ndzh6Xbp0CW5ubgAANzc3XLp0SeaImrZ+/XoEBgYiMjISJSUlcoejIzc3F5mZmQgNDTXoOq2Nc+jQ\noQAMq05ramoQFBQENzc3jBo1Cv379zfIumwsTsCw6hIAli1bhnfffRdmZnUpuzX1KUvCb2n/fkNw\n+PBhZGZmYt++ffjggw9w6NAhuUNqEYVCYbD1vGjRIpw9exZHjx6Fh4cHXnjhBblD0iotLcXUqVOx\nbt062Nra6swzpDotLS3FtGnTsG7dOtjY2BhcnZqZmeHo0aPIz8/HDz/8gIMHD+rMN5S6vD3O1NRU\ng6vLPXv2wNXVFcHBwU3+8mhpfcqS8FvSv99QeHh4AABcXFzw2GOPISMjQ+aImubm5obCwkIAwMWL\nF+Hq6ipzRI1zdXXVfkGjoqIMpk6rqqowdepUzJ07F1OmTAFgmHVaG+ecOXO0cRpqndrb2+ORRx7B\nL7/8YpB1Was2zp9//tng6vLHH39EUlISevXqhVmzZuHAgQOYO3duq+pTloRvLH30y8rKcOPGDQDA\nzZs38c0332DgwIEyR9W0sLAwbNu2DQCwbds2bTIwNBcvXtQ+/+KLLwyiToUQiIyMRL9+/RAdHa19\n39DqtKk4DalOr1y5om0GKS8vx7fffovg4GCDq8um4qxNooD8dQkAq1atQl5eHs6ePYtdu3Zh9OjR\n2L59e+vqs11OJ7dAcnKy8Pf3F7179xarVq2SKwy9zpw5IwIDA0VgYKDo37+/QcUZHh4uPDw8hIWF\nhVCpVOKTTz4RRUVFYsyYMQbT7a2xOLds2SLmzp0rBg4cKAYNGiQmT54sCgsL5Q5THDp0SCgUChEY\nGKjTHc/Q6rSxOJOTkw2qTo8dOyaCg4NFYGCgGDhwoHjnnXeEEMLg6rKpOA2pLm+Xmpqq7aXTmvpU\nCGFgp6OJiKhd8I5XREQmggmfiMhEMOETEZkIJnwiIhPBhE9EZCKY8ImITMT/AwkIzHwsJintAAAA\nAElFTkSuQmCC\n",
"text": [
"<matplotlib.figure.Figure at 0x7f66c40d64d0>"
]
}
],
"prompt_number": 72
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With 40 components extracted, about 79% of the total variance in the dataset is explained. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# extract the features\n",
"train_ext = pca.fit_transform(train_data)\n",
"print(train_ext.shape)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(42000, 40)\n"
]
}
],
"prompt_number": 68
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the impact of the feature extraction, now the training data is 40 columns wide."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# transform the test data using the existing parameters\n",
"test_ext = pca.transform(test)\n",
"print(test_ext.shape)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(28000, 40)\n"
]
}
],
"prompt_number": 69
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we are using a nearest neighbors classifier based on distance, the data needs to be normalized."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"min_max_scaler = MinMaxScaler()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 70
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"train_norm = min_max_scaler.fit_transform(train_ext)\n",
"test_norm = min_max_scaler.fit_transform(test_ext)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Training"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# fit the model to the training data using defaults\n",
"# n_neighors = 5\n",
"knn = KNeighborsClassifier()\n",
"knn.fit(train_norm, train_labels)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 40,
"text": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" n_neighbors=5, p=2, weights='uniform')"
]
}
],
"prompt_number": 40
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cross_validation.cross_val_score(knn, train_norm, train_labels, cv=5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 71,
"text": [
"array([ 0.96892857, 0.96928571, 0.97178571, 0.96535714, 0.96964286])"
]
}
],
"prompt_number": 71
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Performing the five-fold cross-validation provides a look at the possible performance on unobserved data drawn from the same population. In this case, the classifier performed well, about 97% accuracy across the folds."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Predicting"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# predict the test classes\n",
"pred = knn.predict(test_norm)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
" # write to a file\n",
"save = pred.round()\n",
"ind = np.arange(1,len(pred) + 1)\n",
"new_save = np.column_stack((ind, save))\n",
"np.savetxt('knnpca.csv',new_save,delimiter=',',fmt='%0.0f')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 53
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment