Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Created July 24, 2015 13:37
Show Gist options
  • Save pmbaumgartner/f6e68af5e24a236bdc41 to your computer and use it in GitHub Desktop.
Save pmbaumgartner/f6e68af5e24a236bdc41 to your computer and use it in GitHub Desktop.
kNN Tutorial
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##K-Nearest Neighbors - Python Demonstration\n",
"**This script will run a k-Nearest Neighbors algorithm on our demo dataset. The goal is to classify a `Favorable` or `Unfavorable` response based on a customer's `Age` and `Sales` variables.**\n",
"\n",
"###Steps:\n",
"1. Import necessary packages\n",
" - We need the entire `pandas` package, but only two specific *names* from `sklearn`. \n",
" - Imports can be confusing to python beginners, for some additional details [see here](http://stackoverflow.com/a/21547572)\n",
"2. Read in the data sets\n",
"3. Define our X and Y objects from the training and testing dataset\n",
" - X typically refers to the predictor variables (also referred to as 'independent variables' or 'features')\n",
" - Y typically refers to our target variable (also referred to as 'dependent variable' or 'response variable')\n",
"4. Use the `StandardScaler` from `sklearn` to standardize the *X* variables\n",
"5. Define our model and fit the model to our data\n",
"6. Predict the Response variable in our test data set\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1. Import necessary packages"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd # for basic data frame / analytics functionality\n",
"\n",
"from sklearn.preprocessing import StandardScaler # This will allow us to standardize data (subtract mean, divide by variance)\n",
"from sklearn.neighbors import KNeighborsClassifier # This will allow us to build a classifier using a kNN model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###2. Read in datasets from Excel files"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train = pd.read_excel('knn Training.xlsx') # using the read_excel function from the pandas ('pd') package\n",
"test = pd.read_excel('knn Testing.xlsx')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Customer</th>\n",
" <th>Age</th>\n",
" <th>Sales</th>\n",
" <th>Response</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>34</td>\n",
" <td>250</td>\n",
" <td>Favorable</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>31</td>\n",
" <td>350</td>\n",
" <td>Favorable</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5</td>\n",
" <td>25</td>\n",
" <td>133</td>\n",
" <td>Favorable</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7</td>\n",
" <td>35</td>\n",
" <td>364</td>\n",
" <td>Favorable</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10</td>\n",
" <td>40</td>\n",
" <td>467</td>\n",
" <td>Favorable</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Customer Age Sales Response\n",
"0 1 34 250 Favorable\n",
"1 4 31 350 Favorable\n",
"2 5 25 133 Favorable\n",
"3 7 35 364 Favorable\n",
"4 10 40 467 Favorable"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head() # the head() function previews the first 5 rows of the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###3. Define our X and Y Variables"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train = train[['Age','Sales']] # X is a list of two variables, hence the double brackets.\n",
"Y_train = train['Response'] # Y is our target, or what we want to predict"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X_test = test[['Age','Sales']] # Repeat for test data set\n",
"Y_test = test['Response']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###4. Standardize our X Variables\n",
"**Note:** Declaring objects before using them is an odd concept for python beginners. What we're going to first do is define `scaler` as a `StandardScaler` class. A class is an object that has associated functions or objects within it. The `StandardScaler` class has a `method` (or `function`) associated with it that fits and transforms the data that's input to it. This is programmatically different than doing something similar in R or SAS where you simply apply a function or procedure. We will see this again when we fit the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- *Below we declare the `scaler` object. If we wanted to change some options on the object (like only subtract the mean) we could do it below*"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"scaler = StandardScaler()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Now we use the `fit` method from `StandardScaler` on our training data. A method is simply a function associated with a class. We define our new, scaled data as `X_train_std`. This will define the mean and std for the scaler so we can use it again on the test data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\pbaumgartner\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\sklearn\\utils\\validation.py:498: UserWarning: StandardScaler assumes floating point values as input, got int64\n",
" \"got %s\" % (estimator, X.dtype))\n"
]
}
],
"source": [
"X_train_std = scaler.fit_transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###5. Define our model and fit the model to our data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notes:** Similar to when we used `StandardScaler`, we will first define our model object. This time we will change some of the options when we declare the object. We want `k = 4`, the metric to be `'euclidean'` and our model to be `'distance'` weighted."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = KNeighborsClassifier(n_neighbors=4, metric='euclidean', weights='distance') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Now we fit the model object to our data. The fit method takes 2 arguments, our predictor variables and our target variable. Once we run this, it will spit out our model's parameters."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',\n",
" metric_params=None, n_neighbors=4, p=2, weights='distance')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X_train_std, Y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Our model is now fit. To verify, we'll have it predict a point at [0, 0], which would be the standardized mean of both of our Age and Sales Varaibles."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[u'Unfavorable']\n"
]
}
],
"source": [
"print model.predict([0, 0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###6. Predict the Response variable in our test data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to still standardize the predictor varaibles from our test data set. For this instance, we just want to use the `transform` method of `scaler`. Running `fit_transform` would refit `scaler` to the mean and std of our test data."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X_test_norm = scaler.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we get our predictions. Again, we can just call the `predict` method from our model on our test data. Let's store our data in a variable called predictions, then we'll print out the predictions."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[u'Unfavorable' u'Favorable' u'Favorable' u'Favorable' u'Unfavorable'\n",
" u'Favorable']\n"
]
}
],
"source": [
"predictions = model.predict(X_test_norm)\n",
"print predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have our data, but it's in an awkward `array` format. Lets convert this data to a Series so that we can rejoin it with our test data set. A series is similar to a `vector` in R, it's just a structured 1-Dimensional array of data."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predictionSeries = pd.Series(predictions, name='Predictions')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're also going to get our predicted probabilities. We can *chain* our two operations above (get the predictions, store them as a new type) all in one line. One exception is that here we are converting them to a `DataFrame` since the output is 2-Dimensional: we have a Probability of Favorable and a Probability of Unfavorable for each observation."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predictions_proba = pd.DataFrame(model.predict_proba(X_test_norm), columns=['P(Favorable)', 'P(Unfavorable)'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we're going to join both our predicted responses and their probabilities back into our test data set for analysis. We'll store this new object as `testPredictions` and view it."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"testPredictions = test.join([predictionSeries, predictions_proba])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Customer</th>\n",
" <th>Age</th>\n",
" <th>Sales</th>\n",
" <th>Response</th>\n",
" <th>Predictions</th>\n",
" <th>P(Favorable)</th>\n",
" <th>P(Unfavorable)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>15</td>\n",
" <td>19</td>\n",
" <td>101</td>\n",
" <td>NaN</td>\n",
" <td>Unfavorable</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>16</td>\n",
" <td>29</td>\n",
" <td>298</td>\n",
" <td>NaN</td>\n",
" <td>Favorable</td>\n",
" <td>0.764241</td>\n",
" <td>0.235759</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17</td>\n",
" <td>25</td>\n",
" <td>122</td>\n",
" <td>NaN</td>\n",
" <td>Favorable</td>\n",
" <td>0.563406</td>\n",
" <td>0.436594</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18</td>\n",
" <td>36</td>\n",
" <td>232</td>\n",
" <td>NaN</td>\n",
" <td>Favorable</td>\n",
" <td>0.825147</td>\n",
" <td>0.174853</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>19</td>\n",
" <td>22</td>\n",
" <td>235</td>\n",
" <td>NaN</td>\n",
" <td>Unfavorable</td>\n",
" <td>0.208313</td>\n",
" <td>0.791687</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>20</td>\n",
" <td>27</td>\n",
" <td>341</td>\n",
" <td>NaN</td>\n",
" <td>Favorable</td>\n",
" <td>0.801983</td>\n",
" <td>0.198017</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Customer Age Sales Response Predictions P(Favorable) P(Unfavorable)\n",
"0 15 19 101 NaN Unfavorable 0.000000 1.000000\n",
"1 16 29 298 NaN Favorable 0.764241 0.235759\n",
"2 17 25 122 NaN Favorable 0.563406 0.436594\n",
"3 18 36 232 NaN Favorable 0.825147 0.174853\n",
"4 19 22 235 NaN Unfavorable 0.208313 0.791687\n",
"5 20 27 341 NaN Favorable 0.801983 0.198017"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testPredictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Recap\n",
"We've created, fit, and predicted new observations using a k-Nearest Neighbors model. Our final product was the predictions of 6 new observations and whether they were Favorable or Unfavorable.\n",
"\n",
"**Extra Credit** (some ideas for improvement):\n",
"- Validate that 4 is the optimal *k*\n",
"- Validate model using different random splits of the data. \n",
"- Look at sklearn's [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class, which allows us to create a single object that standardizes and transforms the data instead of having two discrete steps."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment