Skip to content

Instantly share code, notes, and snippets.

@tuffacton
Created April 14, 2020 17:09
Show Gist options
  • Save tuffacton/1afcf99de85d3301f9ace3052e8bb025 to your computer and use it in GitHub Desktop.
Save tuffacton/1afcf99de85d3301f9ace3052e8bb025 to your computer and use it in GitHub Desktop.
HW4
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "HW4",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyPzx0T5za4pEXlomBjF61T9",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/tuffacton/1afcf99de85d3301f9ace3052e8bb025/hw4.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y60S3TiBD_Hs",
"colab_type": "text"
},
"source": [
"# Homework 4: ML Method Comparison & Ensemble Models\n",
"Nicolas Acton"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oBfNacioo-Qq",
"colab_type": "text"
},
"source": [
"## Part 1 (Task)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qGVYFiw0ilgK",
"colab_type": "text"
},
"source": [
"You will compare the performance of three different modeling techniques, as well as an ensemble of the three.\n",
"\n",
"Here are the steps:\n",
"1. Load the dataset (each file has sheets for \"train\" and \"test\")\n",
"2. Separate the target from the feature values. In the wifi data it's called \"level\" and in the shuttle data it's called \"class\".\n",
"3. Normalize the data appropriately (in scikit, they refer to this as scaling)\n",
"4. Train a k-nearest neighbor classifier on the training data, and use it to predict the test data output. We experiment with variations of hyperparameters to find a good model. Save the best predicted result.\n",
"5. Train a Gaussian naïve clasifier Bayes classifier; use it to predict the test data output. Experiment with variations (other types of Bayes classifiers, for example) to find a good mdel. Save the best predicted result.\n",
"6. Train a decision tree classifier and use it to predict the test data output. Experiment with variations (depth, criterion, etc.) to find a good model. Save the best predicted result.\n",
"7. Print the misclassification rate of each classifier in this format:\n",
"```\n",
"kNN: Number of mislabeled points out of a total 500 points: 22\n",
"```\n",
"8. Print out the confusion matrix of each classifier in this format:\n",
"```\n",
"[107\t0\t2\t0]\n",
"[0\t112\t4\t3]\n",
"[0\t7\t120\t0]\n",
"[2\t0\t6\t121]\n",
"```\n",
"9. Form a new set of predicted outputs by \"voting\" the three results of the three classifiers. In other words, if the three classifiers return the following class IDs: 4,1,1, the result will be a 1.\n",
"10. Print the misclassification rate and the confusion matrix for the ensemble method formed by voting the results of the three independent classifiers.\n",
"11. Discuss your results. For each of the two datasets, which worked the best? Why? Especially comment on the similarity or differences in the error cases betweeen the four classifiers. Is there a better alternative to combine the three than voting? What would you do to improve the performance?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "id4k8T6mD07I",
"colab_type": "code",
"colab": {}
},
"source": [
"import sklearn\n",
"import pandas as pd\n",
"import numpy as np\n",
"import math"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3GrMU6fZ4IBO",
"colab_type": "text"
},
"source": [
"### Data Preparation\n",
"Luckily, we're using the same two datasets, each of which has their own train and test subsets, for all of our experiments. These datasets have been added to a Github Gist for better reproducibility in the future. We can set all of these to dataframes that will be much easier to act as bases for future experiments."
]
},
{
"cell_type": "code",
"metadata": {
"id": "9Dv4gnSf5mqV",
"colab_type": "code",
"colab": {}
},
"source": [
"shuttle_train_df = pd.read_csv(\"https://gist.githubusercontent.com/tuffacton/0e570983b1c6a5edd06ee816da1f75d7/raw/78b3791470e7d36e2c76de5a427a32bd3b6ca862/shuttle_train.csv\")\n",
"shuttle_test_df = pd.read_csv(\"https://gist.githubusercontent.com/tuffacton/0e570983b1c6a5edd06ee816da1f75d7/raw/78b3791470e7d36e2c76de5a427a32bd3b6ca862/shuttle_test.csv\")\n",
"wifi_train_df = pd.read_csv(\"https://gist.githubusercontent.com/tuffacton/0e570983b1c6a5edd06ee816da1f75d7/raw/78b3791470e7d36e2c76de5a427a32bd3b6ca862/wifi_train.csv\")\n",
"wifi_test_df = pd.read_csv(\"https://gist.githubusercontent.com/tuffacton/0e570983b1c6a5edd06ee816da1f75d7/raw/78b3791470e7d36e2c76de5a427a32bd3b6ca862/wifi_test.csv\")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_9iPIHeuBNqX",
"colab_type": "text"
},
"source": [
"Just for reference, lets give a small example of what each data set looks like."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5h0khXZLBNG6",
"colab_type": "code",
"outputId": "b5d87819-68d1-4fd4-f7af-41183bd6ad2c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"source": [
"# First the shuttle set\n",
"shuttle_train_df.head()"
],
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>x1</th>\n",
" <th>x2</th>\n",
" <th>x3</th>\n",
" <th>x4</th>\n",
" <th>x5</th>\n",
" <th>x6</th>\n",
" <th>x7</th>\n",
" <th>x8</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>50</td>\n",
" <td>21</td>\n",
" <td>77</td>\n",
" <td>0</td>\n",
" <td>28</td>\n",
" <td>0</td>\n",
" <td>27</td>\n",
" <td>48</td>\n",
" <td>22</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>55</td>\n",
" <td>0</td>\n",
" <td>92</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>26</td>\n",
" <td>36</td>\n",
" <td>92</td>\n",
" <td>56</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>53</td>\n",
" <td>0</td>\n",
" <td>82</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-5</td>\n",
" <td>29</td>\n",
" <td>30</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>37</td>\n",
" <td>0</td>\n",
" <td>76</td>\n",
" <td>0</td>\n",
" <td>28</td>\n",
" <td>18</td>\n",
" <td>40</td>\n",
" <td>48</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>37</td>\n",
" <td>0</td>\n",
" <td>79</td>\n",
" <td>0</td>\n",
" <td>34</td>\n",
" <td>-26</td>\n",
" <td>43</td>\n",
" <td>46</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time x1 x2 x3 x4 x5 x6 x7 x8 class\n",
"0 50 21 77 0 28 0 27 48 22 2\n",
"1 55 0 92 0 0 26 36 92 56 4\n",
"2 53 0 82 0 52 -5 29 30 2 1\n",
"3 37 0 76 0 28 18 40 48 8 1\n",
"4 37 0 79 0 34 -26 43 46 2 1"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "jtIXfxq_o5uU",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "cdde30c2-da9c-4756-c68e-eb9ae6de2802"
},
"source": [
"# Lets see all possible class values\n",
"shuttle_test_df['class'].unique()"
],
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([4, 1, 5, 3, 2, 7, 6])"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "uGA4sDuDo0Jo",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "0fd4da1b-2133-4c37-9413-4daf45f07019"
},
"source": [
"# Then the wifi set\n",
"wifi_train_df.head()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>r1</th>\n",
" <th>r2</th>\n",
" <th>r3</th>\n",
" <th>r4</th>\n",
" <th>r5</th>\n",
" <th>r6</th>\n",
" <th>r7</th>\n",
" <th>level</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-59</td>\n",
" <td>-56</td>\n",
" <td>-58</td>\n",
" <td>-66</td>\n",
" <td>-51</td>\n",
" <td>-92</td>\n",
" <td>-88</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-62</td>\n",
" <td>-56</td>\n",
" <td>-57</td>\n",
" <td>-64</td>\n",
" <td>-65</td>\n",
" <td>-85</td>\n",
" <td>-83</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-69</td>\n",
" <td>-58</td>\n",
" <td>-46</td>\n",
" <td>-66</td>\n",
" <td>-48</td>\n",
" <td>-95</td>\n",
" <td>-93</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-56</td>\n",
" <td>-58</td>\n",
" <td>-60</td>\n",
" <td>-59</td>\n",
" <td>-66</td>\n",
" <td>-80</td>\n",
" <td>-77</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-48</td>\n",
" <td>-59</td>\n",
" <td>-53</td>\n",
" <td>-45</td>\n",
" <td>-74</td>\n",
" <td>-81</td>\n",
" <td>-81</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" r1 r2 r3 r4 r5 r6 r7 level\n",
"0 -59 -56 -58 -66 -51 -92 -88 4\n",
"1 -62 -56 -57 -64 -65 -85 -83 1\n",
"2 -69 -58 -46 -66 -48 -95 -93 4\n",
"3 -56 -58 -60 -59 -66 -80 -77 1\n",
"4 -48 -59 -53 -45 -74 -81 -81 3"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "-a1ITsMxppfI",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "1d3ebace-0f7f-4570-8509-548dd6804c33"
},
"source": [
"# Lets see all possible class values\n",
"wifi_train_df.level.unique()"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([4, 1, 3, 2])"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dnhirO35j8ya",
"colab_type": "text"
},
"source": [
"As we can see, we seem to have datasets with a range of continuous features. Per this task, we have been given no domain knowledge, nor is any knowledge assumed, so we approach this problem with no *a priori* indication that any given feature should be weighted more or less than any other. That being said, they must be normalized as the scale of any given feature could be a bias that complicates the resulting models.\n",
"\n",
"Luckily, `scikit` has built in functionality to assist with this. We will use a *Standard Scaler* which uses the strict definition of standardization in that purely centers the data around a mean and normalizes that scale.\n",
"\n",
"We will move the descriptive and target features to X and Y respectively for each data set and perform the intended normalization."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zHrCd5oAX0jR",
"colab_type": "text"
},
"source": [
"We write a function to normalize our X dataframes going forward. We can re-run this with a different scaler if we ever need to do this again."
]
},
{
"cell_type": "code",
"metadata": {
"id": "KRy0pozxRkKO",
"colab_type": "code",
"colab": {}
},
"source": [
"def normalize(x):\n",
" # Change the attributes of the scaler if you wish to experiment.\n",
" from sklearn.preprocessing import StandardScaler\n",
" scaler = StandardScaler()\n",
" x = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)\n",
" return x"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "FcrmQ5JeZCpZ",
"colab_type": "text"
},
"source": [
"Then we normalize our X data while leaving the target features alone."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ezB9HX4zU-ik",
"colab_type": "code",
"colab": {}
},
"source": [
"# First we make X and Y for both the shuttle train and test sets\n",
"shuttle_train_X = normalize(shuttle_train_df.drop([\"class\"], axis=1))\n",
"shuttle_train_Y = shuttle_train_df[\"class\"]\n",
"shuttle_test_X = normalize(shuttle_test_df.drop([\"class\"], axis=1))\n",
"shuttle_test_Y = shuttle_test_df[\"class\"]\n",
"\n",
"# Then we make X and Y for both the wifi train and test sets\n",
"wifi_train_X = normalize(wifi_train_df.drop([\"level\"], axis=1))\n",
"wifi_train_Y = wifi_train_df[\"level\"]\n",
"wifi_test_X = normalize(wifi_test_df.drop([\"level\"], axis=1))\n",
"wifi_test_Y = wifi_test_df[\"level\"]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QhGamFq3ZUoO",
"colab_type": "text"
},
"source": [
"Lets run a quick test to see what we're working with now that normalization has occured."
]
},
{
"cell_type": "code",
"metadata": {
"id": "xdouwQh_Sf9Y",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 221
},
"outputId": "713684fe-1b56-47af-f02e-ff3ebf8e9755"
},
"source": [
"print('A sample of the new normalized wifi training set is:')\n",
"wifi_train_X.head(5)"
],
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": [
"A sample of the new normalized wifi training set is:\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>r1</th>\n",
" <th>r2</th>\n",
" <th>r3</th>\n",
" <th>r4</th>\n",
" <th>r5</th>\n",
" <th>r6</th>\n",
" <th>r7</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.575710</td>\n",
" <td>-0.120151</td>\n",
" <td>-0.556895</td>\n",
" <td>-1.072716</td>\n",
" <td>1.289270</td>\n",
" <td>-1.690842</td>\n",
" <td>-0.957545</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.841395</td>\n",
" <td>-0.120151</td>\n",
" <td>-0.369388</td>\n",
" <td>-0.897855</td>\n",
" <td>-0.256176</td>\n",
" <td>-0.603517</td>\n",
" <td>-0.181828</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1.461327</td>\n",
" <td>-0.703406</td>\n",
" <td>1.693185</td>\n",
" <td>-1.072716</td>\n",
" <td>1.620437</td>\n",
" <td>-2.156838</td>\n",
" <td>-1.733262</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.310025</td>\n",
" <td>-0.703406</td>\n",
" <td>-0.931908</td>\n",
" <td>-0.460701</td>\n",
" <td>-0.366565</td>\n",
" <td>0.173144</td>\n",
" <td>0.749032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.398469</td>\n",
" <td>-0.995033</td>\n",
" <td>0.380638</td>\n",
" <td>0.763328</td>\n",
" <td>-1.249677</td>\n",
" <td>0.017811</td>\n",
" <td>0.128459</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" r1 r2 r3 r4 r5 r6 r7\n",
"0 -0.575710 -0.120151 -0.556895 -1.072716 1.289270 -1.690842 -0.957545\n",
"1 -0.841395 -0.120151 -0.369388 -0.897855 -0.256176 -0.603517 -0.181828\n",
"2 -1.461327 -0.703406 1.693185 -1.072716 1.620437 -2.156838 -1.733262\n",
"3 -0.310025 -0.703406 -0.931908 -0.460701 -0.366565 0.173144 0.749032\n",
"4 0.398469 -0.995033 0.380638 0.763328 -1.249677 0.017811 0.128459"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Oh_EhzKzZkbM",
"colab_type": "text"
},
"source": [
"Looks pretty good! We can use this data as the base for our model-building and make further transformations if needed (just be sure to use the `.copy()` method if transformations are needed, as dataframes are not immutable by default)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FN6sloS44ltw",
"colab_type": "text"
},
"source": [
"### K-Nearest Neighbor Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RPW_aa77HgG9",
"colab_type": "text"
},
"source": [
"We've been tasked with training a k-nearest neighbor classifier on the training data and assessing it against the test data output. We will attempt some `k` optimizations to see if we can find an ideal model. We'll also need our misclassification rates and confusion matrices printed out for what we believe our best model will be."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vHkanM7sHuT_",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.neighbors import KNeighborsClassifier as knn\n",
"from sklearn.metrics import confusion_matrix, accuracy_score, classification_report"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "dADNtkq0UFOG",
"colab_type": "text"
},
"source": [
"Because we're working with a relatively small data-set, both vertically and horizontally, why not just brute-force it across all available hyperparameters? We will attempt to write out a complete algorithm that does just that and output a small report for each \"experiment\" it has run using the error_rate as a stored objective function."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "U8ODKt7_0RRV",
"colab_type": "text"
},
"source": [
"#### KNN for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "59UjDfwyMMLO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "9c2baefb-5f08-4ecc-c072-16e7b8fbaf9a"
},
"source": [
"# Set up the data as copies from our references\n",
"shuttle_X = shuttle_train_X.copy()\n",
"shuttle_Y = shuttle_train_Y.copy()\n",
"shuttle_X_test = shuttle_test_X.copy()\n",
"shuttle_Y_test = shuttle_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_shuttle = pd.DataFrame(data=None, columns=[\"k\", \"p\", \"weight\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for w in [\"uniform\", \"distance\"]:\n",
" for p_measure in [1, 2]:\n",
" for i in range(1,26):\n",
"\n",
" # Define our knn classifier here\n",
" classifier = knn(n_neighbors=i, weights=w, p=p_measure)\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_shuttle=results_shuttle.append({'k': i, 'p': p_measure, 'weight': w, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 455.3734619617462\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6V_cdWSEgPQ5",
"colab_type": "text"
},
"source": [
"Great, we brute-forced in about 7.5 minutes, so now we have every possible optimization for the three hyperparameters we've determined. The error_rates from these experiments have been stored in a new dataframe. Lets observe the top 10."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vh7rsSm-gNMO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "df4a66f3-174f-4fd5-da9c-1d3f8b23a966"
},
"source": [
"top_3 = results_shuttle.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>k</th>\n",
" <th>p</th>\n",
" <th>weight</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>uniform</td>\n",
" <td>0.000690</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>distance</td>\n",
" <td>0.000759</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>uniform</td>\n",
" <td>0.000897</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" k p weight error_rate\n",
"2 3 1 uniform 0.000690\n",
"52 3 1 distance 0.000759\n",
"0 1 1 uniform 0.000897"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "avSC5CjYiKaU",
"colab_type": "text"
},
"source": [
"So according to a brute-force approach exploring the three top hyperparameters for a KNN a `k=3, p =1, and a uniform weight` will result in a very low error rate when compared to the test data. Lets go ahead and extract the metrics we're interested in for these top 3 to explore them a little more closely."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TW1C1EsenHK3",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "3sIiNcvnit9x",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "38670da5-09bc-4331-e630-a1afa5365870"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our knn classifier here\n",
" classifier = knn(n_neighbors=row['k'], weights=row['weight'], p=row['p'])\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for k=', row['k'], \" weight=\", row['weight'], \" p=\", row['p'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for k= 3 weight= uniform p= 1\n",
"Number of mislabeled points out of a total 14500 points: 10\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11476 0 1 0 0 0 1\n",
"2 0 12 0 1 0 0 0\n",
"3 4 0 35 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 2 0 0 0 807 0 0\n",
"6 0 0 0 0 1 3 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9993103448275862\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 0.90 0.93 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 0.75 0.86 4\n",
" 7 0.67 1.00 0.80 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 0.95 0.94 0.94 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for k= 3 weight= distance p= 1\n",
"Number of mislabeled points out of a total 14500 points: 11\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11475 0 1 0 0 0 2\n",
"2 0 12 0 1 0 0 0\n",
"3 4 0 35 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 2 0 0 0 807 0 0\n",
"6 0 0 0 0 1 3 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9992413793103448\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 0.90 0.93 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 0.75 0.86 4\n",
" 7 0.50 1.00 0.67 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 0.92 0.94 0.92 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for k= 1 weight= uniform p= 1\n",
"Number of mislabeled points out of a total 14500 points: 13\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11473 0 3 0 0 0 2\n",
"2 0 12 0 1 0 0 0\n",
"3 4 0 35 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 0 0 0 1 808 0 0\n",
"6 0 0 0 0 1 3 0\n",
"7 1 0 0 0 0 0 1\n",
"\n",
"Accuracy Score: \n",
"0.9991034482758621\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.92 0.90 0.91 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 0.75 0.86 4\n",
" 7 0.33 0.50 0.40 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 0.89 0.87 0.87 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eO24lmeP0VSG",
"colab_type": "text"
},
"source": [
"#### KNN for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Dw0oN08r0gk-",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "bb4782b9-5380-421c-d27c-8babd580f5b4"
},
"source": [
"# Set up the data as copies from our references\n",
"wifi_X = wifi_train_X.copy()\n",
"wifi_Y = wifi_train_Y.copy()\n",
"wifi_X_test = wifi_test_X.copy()\n",
"wifi_Y_test = wifi_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_wifi = pd.DataFrame(data=None, columns=[\"k\", \"p\", \"weight\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for w in [\"uniform\", \"distance\"]:\n",
" for p_measure in [1, 2]:\n",
" for i in range(1,26):\n",
"\n",
" # Define our knn classifier here\n",
" classifier = knn(n_neighbors=i, weights=w, p=p_measure)\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_wifi=results_wifi.append({'k': i, 'p': p_measure, 'weight': w, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 2.226003408432007\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Nk_J9wk_2uem",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "ff12a0c6-59da-413c-cce8-ac24367c1b50"
},
"source": [
"top_3 = results_wifi.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>k</th>\n",
" <th>p</th>\n",
" <th>weight</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>distance</td>\n",
" <td>0.016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>distance</td>\n",
" <td>0.018</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>uniform</td>\n",
" <td>0.018</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" k p weight error_rate\n",
"78 4 2 distance 0.016\n",
"53 4 1 distance 0.018\n",
"27 3 2 uniform 0.018"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NRXLY1xinMVj",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "BilXqNMY8Apo",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "675040ee-36bb-4435-f37b-c8336bc65057"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our knn classifier here\n",
" classifier = knn(n_neighbors=row['k'], weights=row['weight'], p=row['p'])\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for k=', row['k'], \" weight=\", row['weight'], \" p=\", row['p'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for k= 4 weight= distance p= 2\n",
"Number of mislabeled points out of a total 500 points: 8\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 0 1\n",
"2 0 135 3 0\n",
"3 1 0 120 0\n",
"4 1 0 2 128\n",
"\n",
"Accuracy Score: \n",
"0.984\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.98 0.99 0.99 110\n",
" 2 1.00 0.98 0.99 138\n",
" 3 0.96 0.99 0.98 121\n",
" 4 0.99 0.98 0.98 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for k= 4 weight= distance p= 1\n",
"Number of mislabeled points out of a total 500 points: 9\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 0 1\n",
"2 0 134 4 0\n",
"3 1 0 120 0\n",
"4 1 0 2 128\n",
"\n",
"Accuracy Score: \n",
"0.982\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.98 0.99 0.99 110\n",
" 2 1.00 0.97 0.99 138\n",
" 3 0.95 0.99 0.97 121\n",
" 4 0.99 0.98 0.98 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for k= 3 weight= uniform p= 2\n",
"Number of mislabeled points out of a total 500 points: 9\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 0 1\n",
"2 0 133 5 0\n",
"3 0 0 121 0\n",
"4 1 0 2 128\n",
"\n",
"Accuracy Score: \n",
"0.982\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.99 0.99 0.99 110\n",
" 2 1.00 0.96 0.98 138\n",
" 3 0.95 1.00 0.97 121\n",
" 4 0.99 0.98 0.98 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DL72GgNBzaQy",
"colab_type": "text"
},
"source": [
"#### Results"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E2Hx6ZtCzkBo",
"colab_type": "text"
},
"source": [
"Wow, we seem to have models that generalize with incredible accuracy and precision if we choose a KNN with k=3, using a uniform weight, and p = 1 for the shuttle dataset and k= 4, weight= distance, and p= 2 for the wifi dataset. We'll keep this in mind going forward."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Rg7X_-Vz1c4",
"colab_type": "text"
},
"source": [
"### Bayes Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "V9UjKkEY8z8x",
"colab_type": "text"
},
"source": [
"We have been tasked with optimizing for a Gaussian Naïve Bayes classifier as well. Luckily, `sklearn` has a package ready to build and test robust classifiers."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Cz5NCdSu9CPl",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.naive_bayes import GaussianNB as gnb\n",
"from sklearn.metrics import confusion_matrix, accuracy_score, classification_report"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "s0P3itTq9Oly",
"colab_type": "text"
},
"source": [
"#### Gaussian Naïve Bayes for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "9luQ7B4A9Sji",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 683
},
"outputId": "cadee18d-5a3c-4dbf-dded-bc96b3e0114a"
},
"source": [
"# Set up the data as copies from our references\n",
"shuttle_X = shuttle_train_X.copy()\n",
"shuttle_Y = shuttle_train_Y.copy()\n",
"shuttle_X_test = shuttle_test_X.copy()\n",
"shuttle_Y_test = shuttle_test_Y.copy()\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# Define our gnb classifier here\n",
"classifier = gnb()\n",
"classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
"# We make our predictions against the test data\n",
"predictions = classifier.predict(shuttle_X_test)\n",
"actual = shuttle_Y_test\n",
"\n",
"# Determine our error rates\n",
"error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('\\n\\n\\nReport for Gaussian Naive Bayes')\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 0.02194690704345703\n",
"\n",
"\n",
"\n",
"Report for Gaussian Naive Bayes\n",
"Number of mislabeled points out of a total 14500 points: 1220\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 3 4 5 6\n",
"Actual \n",
"1 10848 480 139 4 7\n",
"2 8 3 0 1 1\n",
"3 7 31 0 1 0\n",
"4 565 1 1589 0 0\n",
"5 0 0 0 808 1\n",
"6 0 0 0 0 4\n",
"7 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9158620689655173\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.95 0.95 0.95 11478\n",
" 2 0.00 0.00 0.00 13\n",
" 3 0.06 0.79 0.11 39\n",
" 4 0.92 0.74 0.82 2155\n",
" 5 0.99 1.00 1.00 809\n",
" 6 0.27 1.00 0.42 4\n",
" 7 0.00 0.00 0.00 2\n",
"\n",
" accuracy 0.92 14500\n",
" macro avg 0.46 0.64 0.47 14500\n",
"weighted avg 0.94 0.92 0.93 14500\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, msg_start, len(result))\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pnWMaK9UDWrm",
"colab_type": "text"
},
"source": [
"#### Gaussian Naive Bayes for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "9GLzdDUSDc3x",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 527
},
"outputId": "16c8d7d9-5520-4a45-fbce-e163a86a4d4f"
},
"source": [
"# Set up the data as copies from our references\n",
"wifi_X = wifi_train_X.copy()\n",
"wifi_Y = wifi_train_Y.copy()\n",
"wifi_X_test = wifi_test_X.copy()\n",
"wifi_Y_test = wifi_test_Y.copy()\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# Define our gnb classifier here\n",
"classifier = gnb()\n",
"classifier.fit(wifi_X, wifi_Y)\n",
"\n",
"# We make our predictions against the test data\n",
"predictions = classifier.predict(wifi_X_test)\n",
"actual = wifi_Y_test\n",
"\n",
"# Determine our error rates\n",
"error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('\\n\\n\\nReport for Gaussian Naive Bayes')\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 0.004072904586791992\n",
"\n",
"\n",
"\n",
"Report for Gaussian Naive Bayes\n",
"Number of mislabeled points out of a total 500 points: 10\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 131 7 0\n",
"3 1 0 120 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.98\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.98 0.99 0.99 110\n",
" 2 1.00 0.95 0.97 138\n",
" 3 0.94 0.99 0.96 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PJOzmMqxEKZN",
"colab_type": "text"
},
"source": [
"#### Bernoulli Naive Bayes for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "CPN4EorJEYWt",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 683
},
"outputId": "a887ca8a-4eb8-4e02-f6ab-9cd7c5dc9532"
},
"source": [
"from sklearn.naive_bayes import BernoulliNB as bnb\n",
"\n",
"# Set up the data as copies from our references\n",
"shuttle_X = shuttle_train_X.copy()\n",
"shuttle_Y = shuttle_train_Y.copy()\n",
"shuttle_X_test = shuttle_test_X.copy()\n",
"shuttle_Y_test = shuttle_test_Y.copy()\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# Define our gnb classifier here\n",
"classifier = bnb()\n",
"classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
"# We make our predictions against the test data\n",
"predictions = classifier.predict(shuttle_X_test)\n",
"actual = shuttle_Y_test\n",
"\n",
"# Determine our error rates\n",
"error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('\\n\\n\\nReport for Multinomail Naive Bayes')\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 0.02467489242553711\n",
"\n",
"\n",
"\n",
"Report for Multinomail Naive Bayes\n",
"Number of mislabeled points out of a total 14500 points: 2134\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 4 5\n",
"Actual \n",
"1 11478 0 0\n",
"2 7 6 0\n",
"3 23 12 4\n",
"4 768 291 1096\n",
"5 1 211 597\n",
"6 0 3 1\n",
"7 2 0 0\n",
"\n",
"Accuracy Score: \n",
"0.8528275862068966\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.93 1.00 0.97 11478\n",
" 2 0.00 0.00 0.00 13\n",
" 3 0.00 0.00 0.00 39\n",
" 4 0.56 0.14 0.22 2155\n",
" 5 0.35 0.74 0.48 809\n",
" 6 0.00 0.00 0.00 4\n",
" 7 0.00 0.00 0.00 2\n",
"\n",
" accuracy 0.85 14500\n",
" macro avg 0.26 0.27 0.24 14500\n",
"weighted avg 0.84 0.85 0.82 14500\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, msg_start, len(result))\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "upC14jgDFhsa",
"colab_type": "text"
},
"source": [
"#### Bernoulli Naive Bayes for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "oVS1q5crFggn",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 527
},
"outputId": "d36d7600-2106-4cce-dcdf-c98776bed505"
},
"source": [
"# Set up the data as copies from our references\n",
"wifi_X = wifi_train_X.copy()\n",
"wifi_Y = wifi_train_Y.copy()\n",
"wifi_X_test = wifi_test_X.copy()\n",
"wifi_Y_test = wifi_test_Y.copy()\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# Define our gnb classifier here\n",
"classifier = bnb()\n",
"classifier.fit(wifi_X, wifi_Y)\n",
"\n",
"# We make our predictions against the test data\n",
"predictions = classifier.predict(wifi_X_test)\n",
"actual = wifi_Y_test\n",
"\n",
"# Determine our error rates\n",
"error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('\\n\\n\\nReport for Gaussian Naive Bayes')\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 0.006724357604980469\n",
"\n",
"\n",
"\n",
"Report for Gaussian Naive Bayes\n",
"Number of mislabeled points out of a total 500 points: 55\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 107 0 0 3\n",
"2 0 121 17 0\n",
"3 9 16 88 8\n",
"4 2 0 0 129\n",
"\n",
"Accuracy Score: \n",
"0.89\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.91 0.97 0.94 110\n",
" 2 0.88 0.88 0.88 138\n",
" 3 0.84 0.73 0.78 121\n",
" 4 0.92 0.98 0.95 131\n",
"\n",
" accuracy 0.89 500\n",
" macro avg 0.89 0.89 0.89 500\n",
"weighted avg 0.89 0.89 0.89 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZL6yMk-lFtok",
"colab_type": "text"
},
"source": [
"#### Results"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bfv_F-BIFvEV",
"colab_type": "text"
},
"source": [
"Considering that aren't that many tunable parameters in Naïve Bayes algorithms, the best we could do to explore the classifier is to change the algorithm utilized from the Gaussian NB to the Bernoulli NB (the scaling of the data was not conducive to other Bayesian methods like Multinominal or Complement NB).\n",
"\n",
"As demonstrated, the shuttle dataset does not seem to be a good candidate for the Naïve Bayes classifier. The wifi dataset, however, does seem to be a good candidate as it had an only slightly higher error rate than the KNN optimization we did, and the model building for this Naive Bayes classifier took much less time than searching for the overall optimized KNN that would work best."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QCroAY09Gti3",
"colab_type": "text"
},
"source": [
"### Decision Tree Classifiers"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S-A7JnFAG2AE",
"colab_type": "text"
},
"source": [
"We have been tasked with assessing decision tree classifiers and exploring various hyperparameters. Decision trees have several hyperparameters that can be tuned including depth, criterion, etc. We will construct an optimization search that will, hopefully, find the one that generalizes the best to the test data. There's a lot to see here but for the purposes of this exploration we're going to focus on modifying the `criterion`, the `splitter`, and we'll vary the `max_depth` across a nominal number of values."
]
},
{
"cell_type": "code",
"metadata": {
"id": "djVrsV-yKSIB",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.tree import DecisionTreeClassifier as dtc"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7aBIzBY-KATZ",
"colab_type": "text"
},
"source": [
"#### Decision Tree for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Vm5VEkXZGs64",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "555d4260-ccf9-42c9-b0fb-2828fbd03bac"
},
"source": [
"# Set up the data as copies from our references\n",
"shuttle_X = shuttle_train_X.copy()\n",
"shuttle_Y = shuttle_train_Y.copy()\n",
"shuttle_X_test = shuttle_test_X.copy()\n",
"shuttle_Y_test = shuttle_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_shuttle = pd.DataFrame(data=None, columns=[\"max_depth\", \"splitter\", \"criterion\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for crit in [\"gini\", \"entropy\"]:\n",
" for split in [\"best\", \"random\"]:\n",
" for i in [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, None]:\n",
"\n",
" # Define our knn classifier here\n",
" classifier = dtc(criterion=crit, splitter=split, max_depth=i)\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_shuttle=results_shuttle.append({'criterion': crit, 'splitter': split, 'max_depth': i, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 2.4677255153656006\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Nvl6wzJRKf7s",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "448db8e6-c72d-4c05-b8c7-5e5573db1b20"
},
"source": [
"top_3 = results_shuttle.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>max_depth</th>\n",
" <th>splitter</th>\n",
" <th>criterion</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>12</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.000069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>18</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.000069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>27</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.000138</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" max_depth splitter criterion error_rate\n",
"25 12 best entropy 0.000069\n",
"27 18 best entropy 0.000069\n",
"30 27 best entropy 0.000138"
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "91oj3SzZnvVc",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Nf2Pnz6yKh-8",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "b4e2aa4d-020c-4bac-9550-d92b98d5432f"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our knn classifier here\n",
" classifier = dtc(criterion=row['criterion'], splitter=row['splitter'], max_depth=row['max_depth'])\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"#columns=[\"max_depth\", \"splitter\", \"criterion\",\"error_rate\"])\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for criterion= ', row['criterion'], \" splitter=\", row['splitter'], \" max_depth=\", row['max_depth'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 12\n",
"Number of mislabeled points out of a total 14500 points: 2\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11478 0 0 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 1 0 0 0 808 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9998620689655172\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 1.00 1.00 1.00 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 18\n",
"Number of mislabeled points out of a total 14500 points: 2\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11478 0 0 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 1 0 0 0 808 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9998620689655172\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 1.00 1.00 1.00 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 27\n",
"Number of mislabeled points out of a total 14500 points: 1\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11478 0 0 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 0 0 0 0 809 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9999310344827587\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 1.00 1.00 1.00 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ABlLM9adNeOy",
"colab_type": "text"
},
"source": [
"#### Decision Tree for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Yul2bqgYNmMw",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "26b5ceee-5c3b-4785-c196-15c17d7c406f"
},
"source": [
"# Set up the data as copies from our references\n",
"wifi_X = wifi_train_X.copy()\n",
"wifi_Y = wifi_train_Y.copy()\n",
"wifi_X_test = wifi_test_X.copy()\n",
"wifi_Y_test = wifi_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_wifi = pd.DataFrame(data=None, columns=[\"max_depth\", \"splitter\", \"criterion\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for crit in [\"gini\", \"entropy\"]:\n",
" for split in [\"best\", \"random\"]:\n",
" for i in [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, None]:\n",
"\n",
" # Define our knn classifier here\n",
" classifier = dtc(criterion=crit, splitter=split, max_depth=i)\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_wifi=results_wifi.append({'criterion': crit, 'splitter': split, 'max_depth': i, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 23,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 0.3571510314941406\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "0wcHIxuUOHJy",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "1f6fd4c1-41bf-4bd8-9d21-87430a119817"
},
"source": [
"top_3 = results_wifi.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>max_depth</th>\n",
" <th>splitter</th>\n",
" <th>criterion</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>30</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.026</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>24</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.026</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>18</td>\n",
" <td>best</td>\n",
" <td>entropy</td>\n",
" <td>0.026</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" max_depth splitter criterion error_rate\n",
"31 30 best entropy 0.026\n",
"29 24 best entropy 0.026\n",
"27 18 best entropy 0.026"
]
},
"metadata": {
"tags": []
},
"execution_count": 24
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uoJlfP2aoBfr",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "KtTt2f1nOMQ8",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "bc1e719f-8aa3-4f59-c5d8-7e1c803165ab"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our wifi classifier here\n",
" classifier = dtc(criterion=row['criterion'], splitter=row['splitter'], max_depth=row['max_depth'])\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"#columns=[\"max_depth\", \"splitter\", \"criterion\",\"error_rate\"])\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for criterion= ', row['criterion'], \" splitter=\", row['splitter'], \" max_depth=\", row['max_depth'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 25,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 30\n",
"Number of mislabeled points out of a total 500 points: 13\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 131 7 0\n",
"3 2 2 117 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.974\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.99 0.98 110\n",
" 2 0.98 0.95 0.97 138\n",
" 3 0.94 0.97 0.95 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.97 500\n",
" macro avg 0.97 0.97 0.97 500\n",
"weighted avg 0.97 0.97 0.97 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 24\n",
"Number of mislabeled points out of a total 500 points: 15\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 108 0 1 1\n",
"2 0 130 8 0\n",
"3 2 2 117 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.97\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.98 0.98 110\n",
" 2 0.98 0.94 0.96 138\n",
" 3 0.93 0.97 0.95 121\n",
" 4 0.99 0.99 0.99 131\n",
"\n",
" accuracy 0.97 500\n",
" macro avg 0.97 0.97 0.97 500\n",
"weighted avg 0.97 0.97 0.97 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for criterion= entropy splitter= best max_depth= 18\n",
"Number of mislabeled points out of a total 500 points: 14\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 108 0 1 1\n",
"2 0 131 7 0\n",
"3 2 2 117 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.972\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.98 0.98 110\n",
" 2 0.98 0.95 0.97 138\n",
" 3 0.94 0.97 0.95 121\n",
" 4 0.99 0.99 0.99 131\n",
"\n",
" accuracy 0.97 500\n",
" macro avg 0.97 0.97 0.97 500\n",
"weighted avg 0.97 0.97 0.97 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MOoyr9NdRe2b",
"colab_type": "text"
},
"source": [
"#### Results"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Emv8zP0sRgxX",
"colab_type": "text"
},
"source": [
"The decision tree classifier turned out to be an excellent classifer for both datasets if a decent number of depths for the tree were explored. It turns out that allowing the tree to simply unfold to the maximum depth possible, as encapsulated in the `None` option, does not result in the most optimized classifier when generalized to a testing dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "15MhbBa5R1Mg",
"colab_type": "text"
},
"source": [
"### Ensemble Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rBXd9hk-R3-g",
"colab_type": "text"
},
"source": [
"We have been tasked with performing a similar assessment for a \"voting\" algorithms that is an ensemble of the three models we have built and explored previously. We will do this by using the hyperparameters for the models that have resulted in the optimal generalization against the test dataset and then returning the classes from each of those models into a Dataframe. \n",
"\n",
"The result of the prediction of this ensemble will effectively be a uniform \"vote\" across all three models by selecting the mode. In the case that there isn't a mode across the three predictions, the prediction from the model that resulted in the lowest error rate will be chosen by default. For purposes of transparency, this result will be flagged within the DataFrame with a noticeable flag to measure other statistics."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "msfRBPVaUfWC",
"colab_type": "text"
},
"source": [
"Luckily, we have done all the work exploring the optimized hyperparameters so we can actually make all of our new classifiers and predictions in one go!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xUlR_Y6MgEat",
"colab_type": "text"
},
"source": [
"#### Ensemble Performance for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vTMbytxRUnra",
"colab_type": "code",
"colab": {}
},
"source": [
"# For Shuttle Data\n",
"shuttle_ensemble = pd.DataFrame()\n",
"# knn\n",
"shuttle_knn_classifier = knn(n_neighbors=4, weights='distance', p=2)\n",
"shuttle_ensemble['knn'] = shuttle_knn_classifier.fit(shuttle_X, shuttle_Y).predict(shuttle_X_test)\n",
"# bayes\n",
"shuttle_gnb_classifier = gnb()\n",
"shuttle_ensemble['gnb'] = shuttle_knn_classifier.fit(shuttle_X, shuttle_Y).predict(shuttle_X_test)\n",
"# dtc\n",
"shuttle_dtc_classifier = dtc(criterion=\"entropy\", splitter=\"best\", max_depth=15)\n",
"shuttle_ensemble['dtc'] = shuttle_dtc_classifier.fit(shuttle_X, shuttle_Y).predict(shuttle_X_test)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "DQ7-g44mhnAC",
"colab_type": "text"
},
"source": [
"There could be an issue if there are instances where a mode is not found for a \"vote.\" We will attempt a first run using panda's simple mode function and explore from there."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Utizo_2qe1TY",
"colab_type": "code",
"colab": {}
},
"source": [
"shuttle_ensemble['vote'] = shuttle_ensemble.mode(axis=1, dropna=False)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "PsraR8cZh3mp",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"outputId": "e235bafe-4b8a-410f-9519-4c1192fb445e"
},
"source": [
"# Print the unique values, see if there a NaN or Null\n",
"print(shuttle_ensemble['vote'].unique())\n",
"# Print the number of predictions minus the number of test samples\n",
"print(shuttle_ensemble['vote'].count() - shuttle_X_test['x8'].count(), \"out of\", shuttle_X_test['x8'].count())"
],
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"text": [
"[4 1 5 3 2 7 6]\n",
"0 out of 14500\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aRVytoXqisBA",
"colab_type": "text"
},
"source": [
"Amazingly, there don't seem to be any instances where there are not at least 2 values equivalent, and therefore a mode. It seems that there is very good overlap for all instances in the 14500 sized test set.\n",
"\n",
"Now we will use our previous metrics to measure the efficacy of this ensemble."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r8ceOx1roHf1",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "bwwfTogNj-N4",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 544
},
"outputId": "cc5f5e55-80bc-4c47-c981-a136cb821843"
},
"source": [
"# We make our predictions against the test data\n",
"predictions = shuttle_ensemble['vote']\n",
"actual = shuttle_Y_test\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 31,
"outputs": [
{
"output_type": "stream",
"text": [
"Number of mislabeled points out of a total 14500 points: 16\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11472 0 1 1 2 0 2\n",
"2 0 12 0 1 0 0 0\n",
"3 5 0 34 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 1 0 0 1 807 0 0\n",
"6 0 0 0 0 1 3 0\n",
"7 1 0 0 0 0 0 1\n",
"\n",
"Accuracy Score: \n",
"0.9988965517241379\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 0.87 0.92 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 0.75 0.86 4\n",
" 7 0.33 0.50 0.40 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 0.90 0.86 0.88 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pZ0QUrrygZn8",
"colab_type": "text"
},
"source": [
"#### Ensemble Performance for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uDI5LILacyyk",
"colab_type": "code",
"colab": {}
},
"source": [
"# For Wifi Data\n",
"wifi_ensemble = pd.DataFrame()\n",
"# knn\n",
"wifi_knn_classifier = knn(n_neighbors=4, weights='distance', p=2)\n",
"wifi_ensemble['knn'] = wifi_knn_classifier.fit(wifi_X, wifi_Y).predict(wifi_X_test)\n",
"# bayes\n",
"wifi_gnb_classifier = gnb()\n",
"wifi_ensemble['gnb'] = wifi_gnb_classifier.fit(wifi_X, wifi_Y).predict(wifi_X_test)\n",
"# dtc\n",
"wifi_dtc_classifier = dtc(criterion=\"entropy\", splitter=\"best\", max_depth=15)\n",
"wifi_ensemble['dtc'] = wifi_dtc_classifier.fit(wifi_X, wifi_Y).predict(wifi_X_test)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "yFVKsLSSmHYn",
"colab_type": "code",
"colab": {}
},
"source": [
"wifi_ensemble['vote'] = wifi_ensemble.mode(axis=1, dropna=False)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "yqMlAYKemLi-",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"outputId": "0533d9c3-3175-4cbe-d7ff-8fe200d0f428"
},
"source": [
"# Print the unique values, see if there a NaN or Null\n",
"print(wifi_ensemble['vote'].unique())\n",
"# Print the number of predictions minus the number of test samples\n",
"print(wifi_ensemble['vote'].count() - wifi_X_test['r1'].count(), \"out of\", wifi_X_test['r1'].count())"
],
"execution_count": 34,
"outputs": [
{
"output_type": "stream",
"text": [
"[4 2 3 1]\n",
"0 out of 500\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wbN8HpjHmaAZ",
"colab_type": "text"
},
"source": [
"Amazingly, again there is no issue with NaN or Null values."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wIThz-FgvY_p",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "w5wf6_bImQVc",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 442
},
"outputId": "4917efb2-09f9-4b8e-bcf5-6ec2264105cc"
},
"source": [
"# We make our predictions against the test data\n",
"predictions = wifi_ensemble['vote']\n",
"actual = wifi_Y_test\n",
"\n",
"# We need the number of incorrect values per our tasking\n",
"incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
"# Print our necessary metrics\n",
"print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
"print('\\nConfusion Matrix')\n",
"print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
"print('\\nAccuracy Score: ')\n",
"print(accuracy_score(actual, predictions))\n",
"print('\\nReport: ')\n",
"print(classification_report(actual, predictions))"
],
"execution_count": 35,
"outputs": [
{
"output_type": "stream",
"text": [
"Number of mislabeled points out of a total 500 points: 8\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 133 5 0\n",
"3 1 0 120 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.984\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.98 0.99 0.99 110\n",
" 2 1.00 0.96 0.98 138\n",
" 3 0.95 0.99 0.97 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qJknLfp7onqN",
"colab_type": "text"
},
"source": [
"## Part 1 (Discussion)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2XJbftAhpBU5",
"colab_type": "text"
},
"source": [
"So what has our exploration in optimization yielded? Lets take a look at the Accuracy scores derived from each classifier:\n",
"```markdown\n",
"# Accuracy Scores for Shuttle\n",
"- knn: 0.99889\n",
"- bayes: 0.915\n",
"- dtc: 0.99993\n",
"- ensemble: 0.99889\n",
"\n",
"# Accuracy Scores for Wifi\n",
"- knn: 0.984\n",
"- bayes: 0.98\n",
"- dtc: 0.974\n",
"- ensemble: 0.984\n",
"```\n",
"As demonstrated we seem to be getting the same outcome from a vote in an ensemble model when compared to the predictions derived from K Nearest Neighbor algorithm in both instances. While this proved to be the \"best\" model for the Wifi test scores, the ensemble scored slightly worse when compared to a decision tree discovered through intensive optimization for the Shuttle test scores.\n",
"\n",
"So what does this mean for the efficacy of ensembles? \n",
"\n",
"Well of course, this overall set of experiments I've performed is against two datasets, and each dataset is only a snapshot of a domain that is likely ever-changing, it is hardly characteristics of the efficacy of ensembles across all possible domains where Machine Learning can be applied.\n",
"\n",
"What an ensemble model likely provides is resiliency and increased confidence. If a dataset works well with one KNN today but works better with a Decision Tree tomorrow, then by utilizing our predictors in this fashion, much like 'bagging', then we can ensure that a model that breaks one day from outlier events has an ensemble of parallel models to fall back on, and a \"vote\" between them to account for models that may produce outliers and noise.\n",
"\n",
"In addition, my optimization attempts here are effectively toy-methods of brute-force across all possible combinations of a small subset of hyperparameters. These optimizations would take far more time on longer and wider datasets, and if more hyperparameters are considered. I'll also be transparent that I just seemed to get lucky that these datasets seem to have very clear distinctions between features in order to define classes. In more difficult and larger \"Big Data\" situations, there is likely a parallel ensemble method that is a far better tradeoff of optimization and computer efficiency when voting or having a weighted average the outcome than any one model can provide. This would apply too for ensembles of models utilizing the same algorithm but have differently tuned hyperparameters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TMVgR_8PxEBV",
"colab_type": "text"
},
"source": [
"## Part 2 (Task)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7cbrc6sNxI8d",
"colab_type": "text"
},
"source": [
"We are to perform the same tasks as Part 1 but applied to a Random Forest Classifier. We will explore and experiment with variations to the RFC to find an optimal model and discuss our conclusions about ensemble models."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VNPg_rplxf5l",
"colab_type": "text"
},
"source": [
"### Random Forest Classifier"
]
},
{
"cell_type": "code",
"metadata": {
"id": "gYA1TUgnxFWR",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.ensemble import RandomForestClassifier as rfc "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "UH4b0V5vzyMz",
"colab_type": "text"
},
"source": [
"There are a few hyperparameters a Random Forest Classifier can tune. In this case we'll focus on a subset of them:\n",
"- `n_estimators`: the number of trees that will be in the forest, we'll do a range in periods of 50 from 50 to... 500\n",
"- `max_depth`: like the dtc we'll attempt periods of 3 up to 30\n",
"\n",
"Considering the amount of training time this should take, we'll modify based off these features for now."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ybkhADnn346m",
"colab_type": "text"
},
"source": [
"#### Random Forest Classifier for Shuttle Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "DJoxJqs4zv3m",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "4cd89936-aee3-4b7b-9d23-36c8e66915e0"
},
"source": [
"# Set up the data as copies from our references\n",
"shuttle_X = shuttle_train_X.copy()\n",
"shuttle_Y = shuttle_train_Y.copy()\n",
"shuttle_X_test = shuttle_test_X.copy()\n",
"shuttle_Y_test = shuttle_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_shuttle = pd.DataFrame(data=None, columns=[\"n_estimators\", \"criterion\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for n in range(50, 550, 50):\n",
" for crit in [\"gini\", \"entropy\"]:\n",
" # Define our knn classifier here\n",
" classifier = rfc(n_estimators=n, criterion=crit)\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_shuttle=results_shuttle.append({'n_estimators': n, 'criterion': crit, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 37,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 132.00880122184753\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bgNj6HR72tdW",
"colab_type": "text"
},
"source": [
"Wow, that took some time. Lets see the results."
]
},
{
"cell_type": "code",
"metadata": {
"id": "nXYeCPhG18Kn",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "352d4b1b-53d4-4998-cb63-a21770b63d0b"
},
"source": [
"top_3 = results_shuttle.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 38,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>n_estimators</th>\n",
" <th>criterion</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>450</td>\n",
" <td>entropy</td>\n",
" <td>0.000069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>150</td>\n",
" <td>entropy</td>\n",
" <td>0.000069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>200</td>\n",
" <td>entropy</td>\n",
" <td>0.000069</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" n_estimators criterion error_rate\n",
"17 450 entropy 0.000069\n",
"5 150 entropy 0.000069\n",
"7 200 entropy 0.000069"
]
},
"metadata": {
"tags": []
},
"execution_count": 38
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_zfyfSW53zax",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "wg0JgZlO214p",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "7d771913-cd4f-4aae-ee90-52e148de6e1c"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our knn classifier here\n",
" classifier = rfc(n_estimators=n, criterion=crit)\n",
" classifier.fit(shuttle_X, shuttle_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(shuttle_X_test)\n",
" actual = shuttle_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for n_estimators= ', row['n_estimators'], \" criterion=\", row['criterion'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 39,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for n_estimators= 450 criterion= entropy\n",
"Number of mislabeled points out of a total 14500 points: 2\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11477 0 1 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 0 0 0 0 809 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9998620689655172\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 1.00 0.99 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for n_estimators= 150 criterion= entropy\n",
"Number of mislabeled points out of a total 14500 points: 2\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11477 0 1 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 0 0 0 0 809 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9998620689655172\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 1.00 0.99 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n",
"\n",
"\n",
"\n",
"Report for n_estimators= 200 criterion= entropy\n",
"Number of mislabeled points out of a total 14500 points: 2\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4 5 6 7\n",
"Actual \n",
"1 11477 0 1 0 0 0 0\n",
"2 0 12 0 1 0 0 0\n",
"3 0 0 39 0 0 0 0\n",
"4 0 0 0 2155 0 0 0\n",
"5 0 0 0 0 809 0 0\n",
"6 0 0 0 0 0 4 0\n",
"7 0 0 0 0 0 0 2\n",
"\n",
"Accuracy Score: \n",
"0.9998620689655172\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 1.00 1.00 1.00 11478\n",
" 2 1.00 0.92 0.96 13\n",
" 3 0.97 1.00 0.99 39\n",
" 4 1.00 1.00 1.00 2155\n",
" 5 1.00 1.00 1.00 809\n",
" 6 1.00 1.00 1.00 4\n",
" 7 1.00 1.00 1.00 2\n",
"\n",
" accuracy 1.00 14500\n",
" macro avg 1.00 0.99 0.99 14500\n",
"weighted avg 1.00 1.00 1.00 14500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BfSNnJv-3-y5",
"colab_type": "text"
},
"source": [
"#### Random Forest Classifier for Wifi Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "kgR04Ki63Teg",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "9ea9d4b8-d3d1-48f3-d335-2b47031116f7"
},
"source": [
"# Set up the data as copies from our references\n",
"wifi_X = wifi_train_X.copy()\n",
"wifi_Y = wifi_train_Y.copy()\n",
"wifi_X_test = wifi_test_X.copy()\n",
"wifi_Y_test = wifi_test_Y.copy()\n",
"\n",
"# We will create a dataframe to store the parameters and results of our experiments\n",
"results_wifi = pd.DataFrame(data=None, columns=[\"n_estimators\", \"criterion\",\"error_rate\"])\n",
"\n",
"# Just for kicks, we'll time it\n",
"import time\n",
"start_time = time.time()\n",
"\n",
"# We will be iterating from\n",
"for n in range(50, 550, 50):\n",
" for crit in [\"gini\", \"entropy\"]:\n",
" # Define our knn classifier here\n",
" classifier = rfc(n_estimators=n, criterion=crit)\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # Determine our error rates\n",
" error_rate = ((predictions == actual).value_counts()[False]/actual.count())\n",
" results_wifi=results_wifi.append({'n_estimators': n, 'criterion': crit, 'error_rate': error_rate}, ignore_index=True)\n",
"end_time = time.time()\n",
"\n",
"# We print the overall time it takes for our toy optimization\n",
"print(\"Seconds for calculation:\", end_time-start_time)"
],
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"text": [
"Seconds for calculation: 11.800642251968384\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Nl_JSyjY4bkK",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "312bcdbf-630b-48ee-e8f8-70ff1f243f66"
},
"source": [
"top_3 = results_wifi.sort_values(by='error_rate').head(3)\n",
"top_3"
],
"execution_count": 41,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>n_estimators</th>\n",
" <th>criterion</th>\n",
" <th>error_rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>50</td>\n",
" <td>gini</td>\n",
" <td>0.016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>400</td>\n",
" <td>entropy</td>\n",
" <td>0.016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>450</td>\n",
" <td>entropy</td>\n",
" <td>0.018</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" n_estimators criterion error_rate\n",
"0 50 gini 0.016\n",
"15 400 entropy 0.016\n",
"17 450 entropy 0.018"
]
},
"metadata": {
"tags": []
},
"execution_count": 41
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UOQl500B5j7P",
"colab_type": "text"
},
"source": [
"##### Error Rates"
]
},
{
"cell_type": "code",
"metadata": {
"id": "iOPH3vc04mcA",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "65d8fa81-4bd9-4177-f8c2-d403c3900792"
},
"source": [
"for index, row in top_3.iterrows():\n",
" # Redo our calculations but only for these 3\n",
"\n",
" # Define our knn classifier here\n",
" classifier = rfc(n_estimators=n, criterion=crit)\n",
" classifier.fit(wifi_X, wifi_Y)\n",
"\n",
" # We make our predictions against the test data\n",
" predictions = classifier.predict(wifi_X_test)\n",
" actual = wifi_Y_test\n",
"\n",
" # We need the number of incorrect values per our tasking\n",
" incorrect = (predictions == actual).value_counts()[False]\n",
"\n",
" # Print our necessary metrics\n",
" print('\\n\\n\\nReport for n_estimators= ', row['n_estimators'], \" criterion=\", row['criterion'])\n",
" print('Number of mislabeled points out of a total ', actual.count(), ' points: ', incorrect)\n",
" print('\\nConfusion Matrix')\n",
" print(pd.crosstab(actual, predictions, rownames=['Actual'], colnames=['Predicted']))\n",
" print('\\nAccuracy Score: ')\n",
" print(accuracy_score(actual, predictions))\n",
" print('\\nReport: ')\n",
" print(classification_report(actual, predictions))"
],
"execution_count": 42,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Report for n_estimators= 50 criterion= gini\n",
"Number of mislabeled points out of a total 500 points: 9\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 133 5 0\n",
"3 2 0 119 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.982\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.99 0.98 110\n",
" 2 1.00 0.96 0.98 138\n",
" 3 0.95 0.98 0.97 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for n_estimators= 400 criterion= entropy\n",
"Number of mislabeled points out of a total 500 points: 8\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 134 4 0\n",
"3 2 0 119 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.984\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.99 0.98 110\n",
" 2 1.00 0.97 0.99 138\n",
" 3 0.96 0.98 0.97 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n",
"\n",
"\n",
"\n",
"Report for n_estimators= 450 criterion= entropy\n",
"Number of mislabeled points out of a total 500 points: 9\n",
"\n",
"Confusion Matrix\n",
"Predicted 1 2 3 4\n",
"Actual \n",
"1 109 0 1 0\n",
"2 0 133 5 0\n",
"3 2 0 119 0\n",
"4 1 0 0 130\n",
"\n",
"Accuracy Score: \n",
"0.982\n",
"\n",
"Report: \n",
" precision recall f1-score support\n",
"\n",
" 1 0.97 0.99 0.98 110\n",
" 2 1.00 0.96 0.98 138\n",
" 3 0.95 0.98 0.97 121\n",
" 4 1.00 0.99 1.00 131\n",
"\n",
" accuracy 0.98 500\n",
" macro avg 0.98 0.98 0.98 500\n",
"weighted avg 0.98 0.98 0.98 500\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EZ65mHrbY77s",
"colab_type": "text"
},
"source": [
"## Part 2 (Discussion)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EV1M-9ONZEs4",
"colab_type": "text"
},
"source": [
"Here we've run brute forced optimization against a non-exhaustive set of hyperparameters for Random Forest Classifers, which are considered an ensemble modeling methodology derivative of Decision Tree Classifiers. Based on our experimentation, we've retrieved the following peak Accuracy Scores:\n",
"```markdown\n",
"# Accuracy Scores for Shuttle\n",
"- knn: 0.99889\n",
"- bayes: 0.915\n",
"- dtc: 0.99993\n",
"- ensemble: 0.99889\n",
"- rfc: 0.99986\n",
"\n",
"# Accuracy Scores for Wifi\n",
"- knn: 0.984\n",
"- bayes: 0.98\n",
"- dtc: 0.974\n",
"- ensemble: 0.984\n",
"- rfc: 0.984\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LwLOSU01brOW",
"colab_type": "text"
},
"source": [
"So for the case of the Shuttle Data, the RFC does slightly worse than the DTC (I would say not significant enough to make a distinct decision on which is \"better\") and for the case of the Wifi data the RFC performs just as well as the KNN and the previous voting ensemble."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-uS-aX-IcDbw",
"colab_type": "text"
},
"source": [
"### Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0UFJ8wPgcHC1",
"colab_type": "text"
},
"source": [
"So what to conclude about the value of ensemble methods? \n",
"\n",
"I'll have to give the cop-out engineering answer of \"it depends.\" Through these exercises we have effectively evaluated several singular and ensemble methods of modeling against relatively small datasets, and found that through pure accuracy scores the ensembles don't seem to be significantly better.\n",
"\n",
"However, the pure accuracy score for a dataset that represents a small snapshot of a series of events is hardly indicative of true, industrial Machine Learning pipelines. In addition, there have been several instances where ensembles derived through rigorous exploration and testing have demonstrably proven to be more effective than any singular model:\n",
"- [Ensemble machine learning on gene expression data for cancer classification](http://bura.brunel.ac.uk/handle/2438/3013)\n",
"- [A novel ensemble machine learning for robust microarray data classification](https://www.sciencedirect.com/science/article/abs/pii/S0010482505000533)\n",
"- [An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins](https://doi.org/10.1093/bioinformatics/btg1027)\n",
"\n",
"Not to mention, this experimentation I've performed here is with no information or consideration of the domain that each dataset is derived from. Perhaps there are methods where domain knowledge would educate how an ensemble should be constructed that will perform better in perpetuity.\n",
"\n",
"If there's anything to learn from these exercises, it's that it doesn't hurt to use a set of your data to explore different models/ensembles and run controlled optimizations and experiments. Proceeding forward with \"my model will work in perpetuity\" might allow for degradation of your underlying service."
]
},
{
"cell_type": "code",
"metadata": {
"id": "4HtJGsitqwIA",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment