Sanket758/machine-learning-workshop.ipynb

## machine-learning-workshop.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.6"
    },
    "colab": {
      "name": "Machine Learning Workshop.ipynb",
      "provenance": [],
      "include_colab_link": true
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/Sanket758/e43884f2cebc041f8d01e8e5ba36eed1/machine-learning-workshop.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y8Dgd0Z2uRkJ",
        "colab_type": "text"
      },
      "source": [
        "# Introduction to Linear Regression"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SWtodtRquRkK",
        "colab_type": "text"
      },
      "source": [
        " linear regression is a very popular machine learning algorithm. Linear regression is worth trying whenever the target column is a continuous value. The value of a home is generally considered to be continuous. There is technically no limit to how high the cost of a home may be. It could take any value between two numbers, despite often rounding up."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N1HlB7L_uRkN",
        "colab_type": "text"
      },
      "source": [
        "### Using Linear Regression to Predict the Accuracy of the Hosing Prices"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FtSugqREuRkN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "from sklearn.linear_model import LinearRegression\n",
        "from sklearn.metrics import mean_squared_error\n",
        "from sklearn.model_selection import train_test_split"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "feaZzMTzuRkT",
        "colab_type": "code",
        "colab": {},
        "outputId": "ea763326-dfe5-4ebd-84cd-2f3035dd7be0"
      },
      "source": [
        "# load data\n",
        "housing_df = pd.read_csv('HousingData.csv')\n",
        "housing_df.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>CRIM</th>\n",
              "      <th>ZN</th>\n",
              "      <th>INDUS</th>\n",
              "      <th>CHAS</th>\n",
              "      <th>NOX</th>\n",
              "      <th>RM</th>\n",
              "      <th>AGE</th>\n",
              "      <th>DIS</th>\n",
              "      <th>RAD</th>\n",
              "      <th>TAX</th>\n",
              "      <th>PTRATIO</th>\n",
              "      <th>B</th>\n",
              "      <th>LSTAT</th>\n",
              "      <th>MEDV</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.00632</td>\n",
              "      <td>18.0</td>\n",
              "      <td>2.31</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.538</td>\n",
              "      <td>6.575</td>\n",
              "      <td>65.2</td>\n",
              "      <td>4.0900</td>\n",
              "      <td>1</td>\n",
              "      <td>296</td>\n",
              "      <td>15.3</td>\n",
              "      <td>396.90</td>\n",
              "      <td>4.98</td>\n",
              "      <td>24.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.02731</td>\n",
              "      <td>0.0</td>\n",
              "      <td>7.07</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.469</td>\n",
              "      <td>6.421</td>\n",
              "      <td>78.9</td>\n",
              "      <td>4.9671</td>\n",
              "      <td>2</td>\n",
              "      <td>242</td>\n",
              "      <td>17.8</td>\n",
              "      <td>396.90</td>\n",
              "      <td>9.14</td>\n",
              "      <td>21.6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>0.02729</td>\n",
              "      <td>0.0</td>\n",
              "      <td>7.07</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.469</td>\n",
              "      <td>7.185</td>\n",
              "      <td>61.1</td>\n",
              "      <td>4.9671</td>\n",
              "      <td>2</td>\n",
              "      <td>242</td>\n",
              "      <td>17.8</td>\n",
              "      <td>392.83</td>\n",
              "      <td>4.03</td>\n",
              "      <td>34.7</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>0.03237</td>\n",
              "      <td>0.0</td>\n",
              "      <td>2.18</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.458</td>\n",
              "      <td>6.998</td>\n",
              "      <td>45.8</td>\n",
              "      <td>6.0622</td>\n",
              "      <td>3</td>\n",
              "      <td>222</td>\n",
              "      <td>18.7</td>\n",
              "      <td>394.63</td>\n",
              "      <td>2.94</td>\n",
              "      <td>33.4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.06905</td>\n",
              "      <td>0.0</td>\n",
              "      <td>2.18</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.458</td>\n",
              "      <td>7.147</td>\n",
              "      <td>54.2</td>\n",
              "      <td>6.0622</td>\n",
              "      <td>3</td>\n",
              "      <td>222</td>\n",
              "      <td>18.7</td>\n",
              "      <td>396.90</td>\n",
              "      <td>NaN</td>\n",
              "      <td>36.2</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \\\n",
              "0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   \n",
              "1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   \n",
              "2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   \n",
              "3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   \n",
              "4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222     18.7   \n",
              "\n",
              "        B  LSTAT  MEDV  \n",
              "0  396.90   4.98  24.0  \n",
              "1  396.90   9.14  21.6  \n",
              "2  392.83   4.03  34.7  \n",
              "3  394.63   2.94  33.4  \n",
              "4  396.90    NaN  36.2  "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 2
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QMddYy2HuRkY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# drop null values\n",
        "housing_df = housing_df.dropna()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_J-bAlP_uRkc",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# declare X and y\n",
        "X = housing_df.iloc[:,:-1]\n",
        "y = housing_df.iloc[:, -1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4zMDRCKNuRkg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#Create training and test sets\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CScPOtfLuRkk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#Create the regressor: reg\n",
        "reg = LinearRegression()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Pu5S_37uuRko",
        "colab_type": "code",
        "colab": {},
        "outputId": "c7c60140-8caf-4e96-e073-0f3fa194781b"
      },
      "source": [
        "#Fit the regressor to the training data\n",
        "reg.fit(X_train, y_train)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bw59KWZLuRks",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Predict on the test data: y_pred\n",
        "y_pred = reg.predict(X_test)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZoxRbdQQuRky",
        "colab_type": "code",
        "colab": {},
        "outputId": "e327a7ac-8408-46b4-8054-79a76feea162"
      },
      "source": [
        "# Compute and print RMSE\n",
        "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
        "print(\"Root Mean Squared Error: {}\".format(rmse))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Root Mean Squared Error: 3.331279959482406\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jEbl1xf5uRk1",
        "colab_type": "text"
      },
      "source": [
        "# Cross Validation\n",
        "In cross-validation, also known as CV, the training data is split into five folds (any number will do, but five is standard). The machine learning algorithm is fit on one fold at a time and tested on the remaining data. The result is five different training and test sets that are all representative of the same data. The mean of the scores is usually taken as the accuracy of the model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2YB8bascuRk2",
        "colab_type": "text"
      },
      "source": [
        "### Using the cross_val_score Function to Get Accurate Results on the Dataset "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v6syZ8wKuRk2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.model_selection import cross_val_score"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-FGjsnxKuRk6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Define the regression_model_cv function, which takes a fitted model as one parameter. The k = 5 hyperparameter gives the number of folds.\n",
        "def regression_model_cv(model, k=5):\n",
        "    scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=k)\n",
        "    rmse = np.sqrt(-scores)\n",
        "    print('Reg rmse:', rmse)\n",
        "    print('Reg mean:', rmse.mean ())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BUGvAr4_uRk-",
        "colab_type": "text"
      },
      "source": [
        "In sklearn, the scoring options are sometimes limited. Since mean_squared_error is not an option for cross_val_score, we choose the neg_mean_squared_error. cross_val_score takes the highest value by default, and the highest negative mean squared error is 0."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xg4ol_TOuRk_",
        "colab_type": "code",
        "colab": {},
        "outputId": "f3f9fceb-5429-4a9e-e92d-76212a6b2614"
      },
      "source": [
        "regression_model_cv(LinearRegression())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.26123843 4.42712448 5.66151114 8.09493087 5.24453989]\n",
            "Reg mean: 5.337868962878373\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jKDzeadquRlC",
        "colab_type": "code",
        "colab": {},
        "outputId": "fe3d8168-eec6-42c8-a5b7-484c5ba56620"
      },
      "source": [
        "#Use the regression_model_cv function on the LinearRegression() model with 3 folds and then 6 folds, as shown in the following code snippet, for 3 folds:\n",
        "regression_model_cv(LinearRegression(), k=3)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [ 3.72504914  6.01655701 23.20863933]\n",
            "Reg mean: 10.983415161090685\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OFZtNAA7uRlE",
        "colab_type": "code",
        "colab": {},
        "outputId": "3b63d29e-115d-4617-c9ca-b5cecd08de3b"
      },
      "source": [
        "# Now, test the values for 6 folds\n",
        "regression_model_cv(LinearRegression(), k=6)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.23879491 3.97041949 5.58329663 3.92861033 9.88399671 3.91442679]\n",
            "Reg mean: 5.08659081080109\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "66cprzOKuRlH",
        "colab_type": "text"
      },
      "source": [
        "# Regularization: Ridge and Lasso"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fACqT5XvuRlI",
        "colab_type": "text"
      },
      "source": [
        "**Regularization** is an important concept in machine learning; it's used to counteract overfitting. In the world of big data, it's easy to overfit data to the training set. When this happens, the model will often perform badly on the test set as indicated by mean_squared_error, or some other error.\n",
        "\n",
        "**Ridge** is a simple alternative to linear regression, designed to counteract overfitting. Ridge includes an L2 penalty term (L2 is based on Euclidean Distance) that shrinks the linear coefficients based on their size. The coefficients are the weights, numbers that determine how influential each column is on the output. Larger weights carry greater penalties in Ridge.\n",
        "\n",
        "**Lasso** is another regularized alternative to linear regression. Lasso adds a penalty equal to the absolute value of the magnitude of coefficients. This L1 regularization (L1 is taxicab distance) can eliminate some columns and result in a model that is sparse by comparison."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7vg-EB6QuRlJ",
        "colab_type": "text"
      },
      "source": [
        "Let's look at an example to check how Ridge and Lasso perform on our Boston Housing dataset."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "o5gnEL8nuRlJ",
        "colab_type": "code",
        "colab": {},
        "outputId": "715d715f-5a90-4876-8e94-8d6f02e80f16"
      },
      "source": [
        "#We begin by setting Ridge() as a parameter for regression_model_cv\n",
        "from sklearn.linear_model import Ridge\n",
        "regression_model_cv(Ridge())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.17202127 4.54972372 5.36604368 8.03715216 5.03988501]\n",
            "Reg mean: 5.2329651662517715\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5b_-_hRouRlO",
        "colab_type": "text"
      },
      "source": [
        "It's not surprising that Ridge has a slightly better score than linear regression. This is because both algorithms use Euclidean distance and the linear regression model is overfitting the data by a slight amount. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v3VdVCdzuRlP",
        "colab_type": "code",
        "colab": {},
        "outputId": "b2fa550c-f0df-4f60-dc6d-4a37852ef696"
      },
      "source": [
        "# Now, set Lasso() as the parameter for regression_model_cv:\n",
        "from sklearn.linear_model import Lasso\n",
        "regression_model_cv(Lasso())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.52318747 5.70083491 7.82318757 6.9878025  3.97229348]\n",
            "Reg mean: 5.60146118538429\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cuqpngAHuRlT",
        "colab_type": "text"
      },
      "source": [
        "Whenever you're trying LinearRegression(), it's always worth trying Lasso and Ridge as well, since overfitting the data is common, and they only actually take a few lines of code to test. Lasso does not perform as well here because the L1 distance metric, taxicab distance, was not used in our model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qh37SX6XuRlU",
        "colab_type": "text"
      },
      "source": [
        "# K-Nearest Neighbors, Decision Trees, and Random Forests"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "90yfFvMzuRlU",
        "colab_type": "text"
      },
      "source": [
        "Are there other machine learning algorithms, besides LinearRegression(), that is suitable for the Boston Housing dataset? Absolutely. There are many regressors in the scikit-learn library that may be used. Regressors are generally considered a class of machine learning algorithms that are suitable for continuous target values. In addition to Linear Regression, Ridge, and Lasso, we can try K-Nearest Neighbors, Decision Trees, and Random Forests. These models perform well on a wide range of datasets. Let's try them out and analyze them individually."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yvw6Y70SuRlV",
        "colab_type": "text"
      },
      "source": [
        "# K-Nearest Neighbors"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6DJ-dbV4uRlV",
        "colab_type": "text"
      },
      "source": [
        "The idea behind k-nearest neighbors (KNN) is straightforward. When choosing the output of a row with an unknown label, the prediction is the same as the output of its k-nearest neighbors, where k may be any whole number.\n",
        "\n",
        "For instance, let's say that k=3. Given an unknown label, we take n columns for this row and place them in n-dimensional space. Then we look for the three closest points. These points already have labels. We assume the majority label for our new point.\n",
        "\n",
        "KNN is commonly used for classification since classification is based on grouping values, but it can be applied to regression as well. When determining the value of a home, for instance, in our Boston Housing dataset, it makes sense to compare the values of homes in a similar location, with a similar number of bedrooms, a similar amount of square footage, and so on.\n",
        "\n",
        "You can always choose the number of neighbors for the algorithm and adjust it accordingly. The number of neighbors denoted here is k, which is also called a hyperparameter. In machine learning, the model parameters are derived during training, whereas the hyperparameters are chosen in advance.\n",
        "\n",
        "Fine-tuning hyperparameters is an essential task to master when building machine learning models. Learning the ins and outs of hyperparameter tuning takes time, practice, and experimentation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "59k2Fev9uRlW",
        "colab_type": "text"
      },
      "source": [
        "### Using K-Nearest Neighbors "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fgS11cU6uRlW",
        "colab_type": "code",
        "colab": {},
        "outputId": "3a04f886-6f3a-4f10-b120-e6b935e5ca0f"
      },
      "source": [
        "from sklearn.neighbors import KNeighborsRegressor\n",
        "regression_model_cv(KNeighborsRegressor())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [ 8.24568226  8.81322798 10.58043836  8.85643441  5.98100069]\n",
            "Reg mean: 8.495356738515685\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fVaW6jBSuRlZ",
        "colab_type": "text"
      },
      "source": [
        "We can change the number of neighbors to see if we can get better results. The default number of neighbors is 5. Let's change the number of neighbors to 4, 7, and 10."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XkQYYkCquRlZ",
        "colab_type": "code",
        "colab": {},
        "outputId": "4a022ea2-3bd5-4a21-f798-02ee93770bcc"
      },
      "source": [
        "regression_model_cv(KNeighborsRegressor(n_neighbors=4))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [ 8.44659788  8.99814547 10.97170231  8.86647969  5.72114135]\n",
            "Reg mean: 8.600813339223432\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yueL2qBeuRlc",
        "colab_type": "code",
        "colab": {},
        "outputId": "8c1e12b2-2268-4c98-f8b8-5b545ecdbc01"
      },
      "source": [
        "regression_model_cv(KNeighborsRegressor(n_neighbors=7))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [ 7.99710601  8.68309183 10.66332898  8.90261573  5.51032355]\n",
            "Reg mean: 8.351293217401393\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3LXuO7MzuRlh",
        "colab_type": "code",
        "colab": {},
        "outputId": "e6c59880-730a-4719-d2fb-68351aa04109"
      },
      "source": [
        "regression_model_cv(KNeighborsRegressor(n_neighbors=10))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [ 7.47549287  8.62914556 10.69543822  8.91330686  6.52982222]\n",
            "Reg mean: 8.448641147609868\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eyeq-raLuRll",
        "colab_type": "text"
      },
      "source": [
        "Scikit-learn provides a nice option to check a wide range of hyperparameters, which is GridSearchCV. The idea behind **GridSearchCV** is to use cross-validation to check all possible values in a grid. The value in the grid that gives the best result is then accepted as a hyperparameter."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lUjeEdmruRll",
        "colab_type": "text"
      },
      "source": [
        "# K-Nearest Neighbors with GridSearchCV to Find the Optimal Number of Neighbors"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ykJ-KjnSuRlm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.model_selection import GridSearchCV"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QOzpdWgTuRlq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Now, choose the grid. The grid is the range of numbers – in this case, neighbors – that will be checked. Set up a hyperparameter grid for between 1 and 20 neighbors:\n",
        "neighbors = np.linspace(1,20,20)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Lawyv8GVuRls",
        "colab_type": "text"
      },
      "source": [
        "We achieve this with np.linspace(1, 20, 20), where the 1 is the first number, the first 20 is the last number, and the second 20 in the brackets is the number of intervals to count."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GWdCMWH8uRlt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Convert floats to int (required by knn):\n",
        "k = neighbors.astype(int)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BewLKhuuuRlv",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Now, place the grid in a dictionary\n",
        "param_grid = { 'n_neighbors': k }"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "juHIQ-vyuRly",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Build the model for each neighbor:\n",
        "knn = KNeighborsRegressor()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VHZ4DV3FuRl1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Instantiate the GridSearchCV object – knn_tuned:\n",
        "knn_tuned = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xz7YYbdouRl5",
        "colab_type": "code",
        "colab": {},
        "outputId": "d3127ea2-be50-4678-b5e2-b7b116c26997"
      },
      "source": [
        "#Fit knn_tuned to the data using .fit:\n",
        "knn_tuned.fit(X, y)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "GridSearchCV(cv=5, error_score=nan,\n",
              "             estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,\n",
              "                                           metric='minkowski',\n",
              "                                           metric_params=None, n_jobs=None,\n",
              "                                           n_neighbors=5, p=2,\n",
              "                                           weights='uniform'),\n",
              "             iid='deprecated', n_jobs=None,\n",
              "             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,\n",
              "       18, 19, 20])},\n",
              "             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
              "             scoring='neg_mean_squared_error', verbose=0)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 27
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ewZwAmKBuRl7",
        "colab_type": "code",
        "colab": {},
        "outputId": "6f6cb020-92e0-4982-8db4-5a8fc17f8fc3"
      },
      "source": [
        "# Finally, you print the best parameter results, \n",
        "k = knn_tuned.best_params_\n",
        "print('Best n_neighbors: {}'.format(k))\n",
        "score = knn_tuned.best_score_\n",
        "rsm = np.sqrt(-score)\n",
        "print('Best score: {}'.format(rsm))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Best n_neighbors: {'n_neighbors': 7}\n",
            "Best score: 8.516767055977628\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H3KMCZUXuRl-",
        "colab_type": "text"
      },
      "source": [
        "# Decision Trees and Random Forests"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-c6RTHG7uRl_",
        "colab_type": "text"
      },
      "source": [
        "Decision Trees are very good machine learning algorithms, but they are prone to overfitting. A random forest is an ensemble of decision trees. Random forests consistently outperform decision trees because their predictions generalize to data much better. A random forest may consist of hundreds of decision trees.\n",
        "\n",
        "A random forest is a great machine-learning algorithm to try on almost any dataset. Random forests work well with both regression and classification, and they often perform well out of the box."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tBt1zLe1uRl_",
        "colab_type": "text"
      },
      "source": [
        "### Decision Tree"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "8EfpRcSouRmA",
        "colab_type": "code",
        "colab": {},
        "outputId": "a1786eae-6514-4cf4-9ed1-e84223d3a96a"
      },
      "source": [
        "from sklearn import tree\n",
        "regression_model_cv(tree.DecisionTreeRegressor())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.8171001  5.74304202 8.26752837 6.79114278 5.57497844]\n",
            "Reg mean: 6.038758341497116\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v0YesAAeuRmE",
        "colab_type": "text"
      },
      "source": [
        "### Random Forest"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2YhoM3YWuRmE",
        "colab_type": "code",
        "colab": {},
        "outputId": "f3355959-d6d0-4f19-a648-8226d4203dc4"
      },
      "source": [
        "# Use RandomForestRegressor() as the input for regression_model_cv\n",
        "from sklearn.ensemble import RandomForestRegressor\n",
        "regression_model_cv(RandomForestRegressor())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.21681204 3.58531755 4.46366413 6.48583246 4.02492014]\n",
            "Reg mean: 4.355309264608133\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qs702qS2uRmH",
        "colab_type": "text"
      },
      "source": [
        "# Random Forest Hyperparameters\n",
        "\n",
        "Random forests have a lot of hyperparameters. Instead of going over them all, we will highlight the most important ones:\n",
        "\n",
        "**n_jobs(default=None)**: The number of jobs has to do with internal processing. None means 1. It's ideal to set n_jobs = -1 to permit the use of all processors. Although this does not improve the accuracy of the model, it does improve the speed.\n",
        "\n",
        "**n_estimators(default=10)**: The number of trees in the forest. The more trees, the better. The more trees, the more RAM is required. It's worth increasing this number until the algorithm moves too slowly. Although 1,000,000 trees may give better results than 1,000, the gain might be small enough to be negligible. A good starting point is 100, and 500 if time permits.\n",
        "\n",
        "**max_depth(default=None)**: The max depth of the trees in the forest. The deeper the trees, the more information is captured about the data, but the more prone the trees are to overfitting. When set to the default max_depth of None, there are no limitations, and each tree goes as deep as necessary. The max depth may be reduced to a smaller number of branches.\n",
        "\n",
        "**min_samples_split(default=2)**: This is the minimum number of samples required for a new branch or split to occur. This number can be increased to constrain the trees as they require more samples to make a decision.\n",
        "\n",
        "**min_samples_leaf(default=1)**: This is the same as min_samples_split, except it's the minimum number of samples at the leaves or the base of the tree. By increasing this number, the branch will stop splitting when it reaches this parameter.\n",
        "\n",
        "**max_features(default=\"auto\")**: The number of features to consider when looking for the best split. The default for regression is to consider the total number of columns. For classification random forests, sqrt is recommended."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QG1HsljhuRmH",
        "colab_type": "text"
      },
      "source": [
        "### Random Forest Tuned to Improve the Prediction on Our Dataset"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gAIgUcQHuRmH",
        "colab_type": "code",
        "colab": {},
        "outputId": "02c3d78b-fad1-4f25-9f5e-097877468540"
      },
      "source": [
        "# Set n_jobs = -1 and n_estimators=100 for RandomForestRegressor as the input of regression_model_cv. We can always use n_jobs to speed up the algorithm, and we can increase n_estimators to achieve better results:\n",
        "regression_model_cv(RandomForestRegressor(n_jobs=-1, n_estimators=100))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.26567684 3.70417176 5.00312032 6.47581253 4.15593728]\n",
            "Reg mean: 4.520943745411145\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "faIMeX5buRmK",
        "colab_type": "text"
      },
      "source": [
        "Sklearn provides RandomizedSearchCV to check a wide range of hyperparameters. Instead of exhaustively going through a list, RandomizedSearchCV will check a set amount of random combinations and return the best results."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ixbrt9wjuRmL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.model_selection import RandomizedSearchCV"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "T91Cg5esuRmO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Set up the hyperparameter grid using max_depth\n",
        "param_grid = { 'max_depth': [None, 10, 30 ,50,70,100,200,400],\n",
        "             'min_samples_split': [2,3,4,5],\n",
        "             'min_samples_leaf':[1,2,3],\n",
        "             'max_features': ['auto','sqrt']}"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hk7BDoo1uRmR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "reg = RandomForestRegressor(n_jobs = -1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6GIqIdIKuRmV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "reg_tuned = RandomizedSearchCV(reg, param_grid, cv=5, scoring='neg_mean_squared_error')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OHRbKB4GuRmY",
        "colab_type": "code",
        "colab": {},
        "outputId": "52a709e4-ff9d-4341-dde6-1b5aa4838da2"
      },
      "source": [
        "reg_tuned.fit(X,y)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "RandomizedSearchCV(cv=5, error_score=nan,\n",
              "                   estimator=RandomForestRegressor(bootstrap=True,\n",
              "                                                   ccp_alpha=0.0,\n",
              "                                                   criterion='mse',\n",
              "                                                   max_depth=None,\n",
              "                                                   max_features='auto',\n",
              "                                                   max_leaf_nodes=None,\n",
              "                                                   max_samples=None,\n",
              "                                                   min_impurity_decrease=0.0,\n",
              "                                                   min_impurity_split=None,\n",
              "                                                   min_samples_leaf=1,\n",
              "                                                   min_samples_split=2,\n",
              "                                                   min_weight_fraction_leaf=0.0,\n",
              "                                                   n_estimators=100, n_jobs=-1,\n",
              "                                                   oob_score=False,\n",
              "                                                   random_state=None, verbose=0,\n",
              "                                                   warm_start=False),\n",
              "                   iid='deprecated', n_iter=10, n_jobs=None,\n",
              "                   param_distributions={'max_depth': [None, 10, 30, 50, 70, 100,\n",
              "                                                      200, 400],\n",
              "                                        'max_features': ['auto', 'sqrt'],\n",
              "                                        'min_samples_leaf': [1, 2, 3],\n",
              "                                        'min_samples_split': [2, 3, 4, 5]},\n",
              "                   pre_dispatch='2*n_jobs', random_state=None, refit=True,\n",
              "                   return_train_score=False, scoring='neg_mean_squared_error',\n",
              "                   verbose=0)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 36
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "g2r3XbWiuRmb",
        "colab_type": "code",
        "colab": {},
        "outputId": "3fe5770e-2c22-4bce-fe16-885dd39502ac"
      },
      "source": [
        "p = reg_tuned.best_params_\n",
        "print('Best n_neighbors:{}'.format(p))\n",
        "score = reg_tuned.best_score_\n",
        "rsm = np.sqrt(-score)\n",
        "print('Best score: {}'.format(rsm))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Best n_neighbors:{'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 50}\n",
            "Best score: 4.595081290129918\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9cULuVUNuRmd",
        "colab_type": "code",
        "colab": {},
        "outputId": "819d0a92-777d-4e0c-c76f-a22aef79071b"
      },
      "source": [
        "# Now, run a random forest regressor with n_jobs = -1 and n_estimators = 500:\n",
        "# Setup the hyperparameter grid\n",
        "regression_model_cv(RandomForestRegressor(n_jobs=-1, n_estimators=500))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.19922811 3.7531576  4.81348515 6.50186152 3.92958863]\n",
            "Reg mean: 4.439464203351062\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CUY6O8BBuRmh",
        "colab_type": "text"
      },
      "source": [
        "Hyperparameters are a primary key to building excellent machine learning models. Anyone with basic machine learning training can build machine learning models using default hyperparameters. Using GridSearchCV and RandomizedSearchCV to fine-tune hyperparameters to create more efficient models distinguishes advanced users from beginners."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9JjAHdp8uRmi",
        "colab_type": "text"
      },
      "source": [
        "# Classification Models\n",
        "\n",
        "The Boston Housing dataset was great for regression because the target column took on continuous values without limit. There are many cases when the target column takes on one or two values, such as TRUE or FALSE, or possibly a grouping of three or more values, such as RED, BLUE, or GREEN. When the target column may be split into distinct categories, the group of machine learning models that you should try are referred to as classification."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gfCLxLeXuRmi",
        "colab_type": "text"
      },
      "source": [
        "To make things interesting, let's load a new dataset used to detect pulsar stars in outer space."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ye0OYHzruRmj",
        "colab_type": "code",
        "colab": {},
        "outputId": "a4ae741c-ca1c-4d67-ae70-bc835a1a9a76"
      },
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "from sklearn.linear_model import LinearRegression\n",
        "from sklearn.metrics import mean_squared_error\n",
        "from sklearn.model_selection import train_test_split\n",
        "df = pd.read_csv('HTRU_2.csv')\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>140.5625</th>\n",
              "      <th>55.68378214</th>\n",
              "      <th>-0.234571412</th>\n",
              "      <th>-0.699648398</th>\n",
              "      <th>3.199832776</th>\n",
              "      <th>19.11042633</th>\n",
              "      <th>7.975531794</th>\n",
              "      <th>74.24222492</th>\n",
              "      <th>0</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>102.507812</td>\n",
              "      <td>58.882430</td>\n",
              "      <td>0.465318</td>\n",
              "      <td>-0.515088</td>\n",
              "      <td>1.677258</td>\n",
              "      <td>14.860146</td>\n",
              "      <td>10.576487</td>\n",
              "      <td>127.393580</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>103.015625</td>\n",
              "      <td>39.341649</td>\n",
              "      <td>0.323328</td>\n",
              "      <td>1.051164</td>\n",
              "      <td>3.121237</td>\n",
              "      <td>21.744669</td>\n",
              "      <td>7.735822</td>\n",
              "      <td>63.171909</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>136.750000</td>\n",
              "      <td>57.178449</td>\n",
              "      <td>-0.068415</td>\n",
              "      <td>-0.636238</td>\n",
              "      <td>3.642977</td>\n",
              "      <td>20.959280</td>\n",
              "      <td>6.896499</td>\n",
              "      <td>53.593661</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>88.726562</td>\n",
              "      <td>40.672225</td>\n",
              "      <td>0.600866</td>\n",
              "      <td>1.123492</td>\n",
              "      <td>1.178930</td>\n",
              "      <td>11.468720</td>\n",
              "      <td>14.269573</td>\n",
              "      <td>252.567306</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>93.570312</td>\n",
              "      <td>46.698114</td>\n",
              "      <td>0.531905</td>\n",
              "      <td>0.416721</td>\n",
              "      <td>1.636288</td>\n",
              "      <td>14.545074</td>\n",
              "      <td>10.621748</td>\n",
              "      <td>131.394004</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "     140.5625  55.68378214  -0.234571412  -0.699648398  3.199832776  \\\n",
              "0  102.507812    58.882430      0.465318     -0.515088     1.677258   \n",
              "1  103.015625    39.341649      0.323328      1.051164     3.121237   \n",
              "2  136.750000    57.178449     -0.068415     -0.636238     3.642977   \n",
              "3   88.726562    40.672225      0.600866      1.123492     1.178930   \n",
              "4   93.570312    46.698114      0.531905      0.416721     1.636288   \n",
              "\n",
              "   19.11042633  7.975531794  74.24222492  0  \n",
              "0    14.860146    10.576487   127.393580  0  \n",
              "1    21.744669     7.735822    63.171909  0  \n",
              "2    20.959280     6.896499    53.593661  0  \n",
              "3    11.468720    14.269573   252.567306  0  \n",
              "4    14.545074    10.621748   131.394004  0  "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 39
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vbjM2Y6puRml",
        "colab_type": "text"
      },
      "source": [
        "Looks interesting, and problematic. Notice that the column headers appear to be another row. It's impossible to analyze data without knowing what the columns are supposed to be, right?\n",
        "\n",
        "Note that the last column is all 0's in the DataFrame. This suggests that this is the Class column, which is our target column. When detecting the presence of something – in this case, pulsar stars – it's common to use a 1 for positive identification, and a 0 for a negative identification.\n",
        "\n",
        "Since Class is last in the list, let's assume that the columns are given in the correct order presented in the Attribute Information list. We can also assume that losing the current column headers, a negative identification among 17,898 rows is meaningless. The easiest way forward is simply to change the column headers to match the attribute list."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "e8WtkTbKuRmm",
        "colab_type": "code",
        "colab": {},
        "outputId": "feb83ac2-cf34-4d51-8997-c61dacb1999b"
      },
      "source": [
        "# Now, change column headers to match the official list and print the first five rows, as shown in the following code snippet:\n",
        "df.columns = [['Mean of integrated profile', 'Standard deviation of integrated profile', \n",
        "               'Excess kurtosis of integrated profile', 'Skewness of integrated profile',\n",
        "               'Mean of DM-SNR curve', 'Standard deviation of DM-SNR curve',\n",
        "               'Excess kurtosis of DM-SNR curve', 'Skewness of DM-SNR curve', 'Class' ]]\n",
        "\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead tr th {\n",
              "        text-align: left;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th></th>\n",
              "      <th>Mean of integrated profile</th>\n",
              "      <th>Standard deviation of integrated profile</th>\n",
              "      <th>Excess kurtosis of integrated profile</th>\n",
              "      <th>Skewness of integrated profile</th>\n",
              "      <th>Mean of DM-SNR curve</th>\n",
              "      <th>Standard deviation of DM-SNR curve</th>\n",
              "      <th>Excess kurtosis of DM-SNR curve</th>\n",
              "      <th>Skewness of DM-SNR curve</th>\n",
              "      <th>Class</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>102.507812</td>\n",
              "      <td>58.882430</td>\n",
              "      <td>0.465318</td>\n",
              "      <td>-0.515088</td>\n",
              "      <td>1.677258</td>\n",
              "      <td>14.860146</td>\n",
              "      <td>10.576487</td>\n",
              "      <td>127.393580</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>103.015625</td>\n",
              "      <td>39.341649</td>\n",
              "      <td>0.323328</td>\n",
              "      <td>1.051164</td>\n",
              "      <td>3.121237</td>\n",
              "      <td>21.744669</td>\n",
              "      <td>7.735822</td>\n",
              "      <td>63.171909</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>136.750000</td>\n",
              "      <td>57.178449</td>\n",
              "      <td>-0.068415</td>\n",
              "      <td>-0.636238</td>\n",
              "      <td>3.642977</td>\n",
              "      <td>20.959280</td>\n",
              "      <td>6.896499</td>\n",
              "      <td>53.593661</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>88.726562</td>\n",
              "      <td>40.672225</td>\n",
              "      <td>0.600866</td>\n",
              "      <td>1.123492</td>\n",
              "      <td>1.178930</td>\n",
              "      <td>11.468720</td>\n",
              "      <td>14.269573</td>\n",
              "      <td>252.567306</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>93.570312</td>\n",
              "      <td>46.698114</td>\n",
              "      <td>0.531905</td>\n",
              "      <td>0.416721</td>\n",
              "      <td>1.636288</td>\n",
              "      <td>14.545074</td>\n",
              "      <td>10.621748</td>\n",
              "      <td>131.394004</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "  Mean of integrated profile Standard deviation of integrated profile  \\\n",
              "0                 102.507812                                58.882430   \n",
              "1                 103.015625                                39.341649   \n",
              "2                 136.750000                                57.178449   \n",
              "3                  88.726562                                40.672225   \n",
              "4                  93.570312                                46.698114   \n",
              "\n",
              "  Excess kurtosis of integrated profile Skewness of integrated profile  \\\n",
              "0                              0.465318                      -0.515088   \n",
              "1                              0.323328                       1.051164   \n",
              "2                             -0.068415                      -0.636238   \n",
              "3                              0.600866                       1.123492   \n",
              "4                              0.531905                       0.416721   \n",
              "\n",
              "  Mean of DM-SNR curve Standard deviation of DM-SNR curve  \\\n",
              "0             1.677258                          14.860146   \n",
              "1             3.121237                          21.744669   \n",
              "2             3.642977                          20.959280   \n",
              "3             1.178930                          11.468720   \n",
              "4             1.636288                          14.545074   \n",
              "\n",
              "  Excess kurtosis of DM-SNR curve Skewness of DM-SNR curve Class  \n",
              "0                       10.576487               127.393580     0  \n",
              "1                        7.735822                63.171909     0  \n",
              "2                        6.896499                53.593661     0  \n",
              "3                       14.269573               252.567306     0  \n",
              "4                       10.621748               131.394004     0  "
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GnQ_ueUfuRmo",
        "colab_type": "code",
        "colab": {},
        "outputId": "3d29d2c8-e3dd-4584-b068-cd292de731aa"
      },
      "source": [
        "df.info()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 17897 entries, 0 to 17896\n",
            "Data columns (total 9 columns):\n",
            " #   Column                                       Non-Null Count  Dtype  \n",
            "---  ------                                       --------------  -----  \n",
            " 0   (Mean of integrated profile,)                17897 non-null  float64\n",
            " 1   (Standard deviation of integrated profile,)  17897 non-null  float64\n",
            " 2   (Excess kurtosis of integrated profile,)     17897 non-null  float64\n",
            " 3   (Skewness of integrated profile,)            17897 non-null  float64\n",
            " 4   (Mean of DM-SNR curve,)                      17897 non-null  float64\n",
            " 5   (Standard deviation of DM-SNR curve,)        17897 non-null  float64\n",
            " 6   (Excess kurtosis of DM-SNR curve,)           17897 non-null  float64\n",
            " 7   (Skewness of DM-SNR curve,)                  17897 non-null  float64\n",
            " 8   (Class,)                                     17897 non-null  int64  \n",
            "dtypes: float64(8), int64(1)\n",
            "memory usage: 1.2 MB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gwGElbTUuRmr",
        "colab_type": "code",
        "colab": {},
        "outputId": "55cd6702-ed15-4e83-8b39-6197d82258dc"
      },
      "source": [
        "df.shape"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(17897, 9)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 42
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "TcAfcOVmuRmu",
        "colab_type": "code",
        "colab": {},
        "outputId": "fafcd332-a3a7-4022-e2c1-10ab6a60557b"
      },
      "source": [
        "df.isnull().any()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Mean of integrated profile                  False\n",
              "Standard deviation of integrated profile    False\n",
              "Excess kurtosis of integrated profile       False\n",
              "Skewness of integrated profile              False\n",
              "Mean of DM-SNR curve                        False\n",
              "Standard deviation of DM-SNR curve          False\n",
              "Excess kurtosis of DM-SNR curve             False\n",
              "Skewness of DM-SNR curve                    False\n",
              "Class                                       False\n",
              "dtype: bool"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 43
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_g7h5DKxuRmw",
        "colab_type": "text"
      },
      "source": [
        "We know that there are no null values. If there were null values, we would need to eliminate the rows or fill them in by taking the mean, the median, the mode, or another value from the columns."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F4y2uqaXuRmw",
        "colab_type": "text"
      },
      "source": [
        "When it comes to preparing data for machine learning, it's essential to have clean, numerical data with no null values. Further data analysis is often warranted, depending upon the goal at hand. If the goal is simply to try out some models and check them for accuracy, it's fine to go ahead. If the goal is to uncover deep insights about the data, further statistical analysis, as introduced in the previous chapter, is always warranted."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uVPn6xhvuRmw",
        "colab_type": "text"
      },
      "source": [
        "# Logistic Regression\n",
        "\n",
        "When it comes to datasets that classify points, logistic regression is one of the most popular and successful machine learning algorithms. Logistic regression utilizes the sigmoid function to determine whether points should approach one value or the other. As the following diagram indicates, it's a good idea to classify the target values as 0 and 1 when utilizing logistic regression. In the pulsar dataset, the values are already classified as 0s and 1s. If the dataset was labeled as Red and Blue, converting them in advance to 0 and 1 would be essential"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XBgypeSNuRmx",
        "colab_type": "text"
      },
      "source": [
        "### Using Logistic Regression to Predict Data Accuracy"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QOvhwA6ZuRmx",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Import LogisticRegression:\n",
        "from sklearn.model_selection import cross_val_score\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from warnings import filterwarnings\n",
        "filterwarnings('ignore')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1s6LIitcuRm1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Set up matrices X and y to store the predictors and response variables, respectively:\n",
        "X = df.iloc[:, 0:8]\n",
        "y = df.iloc[:, 8]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6BTzToKluRm4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Build Model\n",
        "def clf_model(model):\n",
        "    clf =model\n",
        "    scores = cross_val_score(clf, X, y)\n",
        "    print('Scores:', scores)\n",
        "    print('Mean Score: ',scores.mean())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aa5q4qRPuRm7",
        "colab_type": "code",
        "colab": {},
        "outputId": "1e0143c8-0c78-43fa-cd61-70c77e3f0666"
      },
      "source": [
        "clf_model(LogisticRegression())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.97486034 0.97877095 0.98127969 0.9779268  0.9782062 ]\n",
            "Mean Score:  0.9782087940047546\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oYq4ZPh8uRm9",
        "colab_type": "text"
      },
      "source": [
        "Logistic regression is very different than linear regression. Logistic regression uses the sigmoid function to classify all instances into one group or the other. Generally speaking, all cases that are above 0.5 are classified as a 1, and all cases that fall below 0.5 are classified as a 0, with decimals that are close to 1 more likely to be a 1, and decimals that are close to 0 more likely to be a 0. Linear regression, by contrast, finds a straight line that minimizes the error between the straight line and the individual points. Logistic regression classifies all points into two groups; all new points will fall into one of these groups. Linear regression finds a line of best fit; all new points may fall anywhere on the line and take on any value."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rzQ_9oKiuRm9",
        "colab_type": "text"
      },
      "source": [
        "# Naive Bayes\n",
        "\n",
        "Naive Bayes is a model based on Bayes' theorem, a famous probability theorem based on a conditional probability that assumes independent events. Similarly, Naive Bayes assumes independent attributes or columns."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fnmahcPwuRm_",
        "colab_type": "text"
      },
      "source": [
        "### Using GaussianNB, KneighborsClassifier, DecisionTreeClassifier, and RandomForestClassifier to Predict Accuracy in Our Dataset\n",
        "\n",
        "The goal of this exercise is to predict pulsars using a variety of classifiers, including GaussianNB, KneighborsClassifier, DecisionTreeClassifier, and RandomForestClassifier."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8N9X7_dhuRm_",
        "colab_type": "code",
        "colab": {},
        "outputId": "f6cd7159-3861-4296-8df1-8b51fbb83d11"
      },
      "source": [
        "# GuassianNB\n",
        "from sklearn.naive_bayes import GaussianNB\n",
        "clf_model(GaussianNB())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.96061453 0.92374302 0.94272143 0.92847164 0.96451523]\n",
            "Mean Score:  0.9440131680613636\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rDLaaou3uRnC",
        "colab_type": "code",
        "colab": {},
        "outputId": "4b6b0682-740e-41be-b92a-079f5170e472"
      },
      "source": [
        "# kNeighborsClassifier\n",
        "from sklearn.neighbors import KNeighborsClassifier\n",
        "clf_model(KNeighborsClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.96955307 0.96927374 0.97317687 0.9706622  0.97289746]\n",
            "Mean Score:  0.9711126668446134\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sg8UhDj-uRnI",
        "colab_type": "code",
        "colab": {},
        "outputId": "e9e8dfe1-c7f5-44fa-a0fc-d5ef3c45c1da"
      },
      "source": [
        "# Decision Tree Clf\n",
        "from sklearn.tree import DecisionTreeClassifier\n",
        "clf_model(DecisionTreeClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.96703911 0.96424581 0.96842693 0.963677   0.96842693]\n",
            "Mean Score:  0.966363158149416\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "jrB1Kn7juRnL",
        "colab_type": "code",
        "colab": {},
        "outputId": "348a5e05-72a2-4f75-a229-f8e4d5cb729a"
      },
      "source": [
        "# Random Forest Clf\n",
        "from sklearn.ensemble import RandomForestClassifier\n",
        "clf_model(RandomForestClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.97765363 0.98184358 0.97988265 0.97541213 0.97736798]\n",
            "Mean Score:  0.9784319923326793\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XiCewXpJuRnN",
        "colab_type": "text"
      },
      "source": [
        "#### All classifiers have achieved between 94% and 98% accuracy."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VMeGC6mQuRnN",
        "colab_type": "text"
      },
      "source": [
        "You may also wonder how to know when to use these classifiers. The bottom line is that whenever you have a classification problem, meaning that the data has a target column with a finite number of options, such as three kinds of wine, all classifiers are worth trying. Naive Bayes is known to work well with text data, and random forests are known to work well generally."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ACUtocv3uRnO",
        "colab_type": "text"
      },
      "source": [
        "# Confusion Matrix and Classification Report for the Pulsar Dataset"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DQ5pQQcbuRnP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.metrics import classification_report\n",
        "from sklearn.metrics import confusion_matrix"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OZSujmWpuRnS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mngvEembuRnW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def confusion(model):\n",
        "    clf = model\n",
        "    clf.fit(X_train, y_train)\n",
        "    y_pred = clf.predict(X_test)\n",
        "    print('Confusion Matrix: ', confusion_matrix(y_test, y_pred))\n",
        "    print('Classification Report:', classification_report(y_test, y_pred))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_NIaHsEquRnZ",
        "colab_type": "code",
        "colab": {},
        "outputId": "5e976300-59df-4c6c-e00c-6f576f2c7f70"
      },
      "source": [
        "confusion(LogisticRegression())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix:  [[4030   24]\n",
            " [  77  344]]\n",
            "Classification Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.98      0.99      0.99      4054\n",
            "           1       0.93      0.82      0.87       421\n",
            "\n",
            "    accuracy                           0.98      4475\n",
            "   macro avg       0.96      0.91      0.93      4475\n",
            "weighted avg       0.98      0.98      0.98      4475\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4WJRVATXuRnb",
        "colab_type": "code",
        "colab": {},
        "outputId": "3aeeb820-fa9d-4f00-9995-037e04098086"
      },
      "source": [
        "confusion(KNeighborsClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix:  [[4018   36]\n",
            " [  88  333]]\n",
            "Classification Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.98      0.99      0.98      4054\n",
            "           1       0.90      0.79      0.84       421\n",
            "\n",
            "    accuracy                           0.97      4475\n",
            "   macro avg       0.94      0.89      0.91      4475\n",
            "weighted avg       0.97      0.97      0.97      4475\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UTp4W7eAuRnd",
        "colab_type": "code",
        "colab": {},
        "outputId": "f3429b9c-98d6-416c-e8c7-6f8fae0a9557"
      },
      "source": [
        "confusion(GaussianNB())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix:  [[3883  171]\n",
            " [  64  357]]\n",
            "Classification Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.98      0.96      0.97      4054\n",
            "           1       0.68      0.85      0.75       421\n",
            "\n",
            "    accuracy                           0.95      4475\n",
            "   macro avg       0.83      0.90      0.86      4475\n",
            "weighted avg       0.95      0.95      0.95      4475\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1Sk0XvaPuRnf",
        "colab_type": "code",
        "colab": {},
        "outputId": "9f8fc21e-4896-40c5-b550-add8d3e5c1b6"
      },
      "source": [
        "confusion(RandomForestClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix:  [[4028   26]\n",
            " [  71  350]]\n",
            "Classification Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.98      0.99      0.99      4054\n",
            "           1       0.93      0.83      0.88       421\n",
            "\n",
            "    accuracy                           0.98      4475\n",
            "   macro avg       0.96      0.91      0.93      4475\n",
            "weighted avg       0.98      0.98      0.98      4475\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o2bbrIY8uRng",
        "colab_type": "text"
      },
      "source": [
        "# Boosting Methods\n",
        "\n",
        "Random Forests are a type of bagging method. A bagging method is a machine learning method that aggregates a large sum of machine learning models. In the case of Random Forests, the aggregates are decision trees.\n",
        "\n",
        "Another machine learning method is boosting. The idea behind boosting is to transform a weak learner into a strong learner by modifying the weights for the rows that the learner got wrong. A weak learner may have an error of 49%, hardly better than a coin flip. A strong learner, by contrast, may have an error rate of 1 or 2 %. With enough iterations, very weak learners can be transformed into very strong learners."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cxKoul2euRnh",
        "colab_type": "text"
      },
      "source": [
        "### Using AdaBoost to Predict the Best Optimal Values"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SMdAX7bZuRnh",
        "colab_type": "code",
        "colab": {},
        "outputId": "d8a979da-c287-478f-9133-83f6231df8b8"
      },
      "source": [
        "from sklearn.ensemble import AdaBoostClassifier\n",
        "clf_model(AdaBoostClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.97430168 0.97988827 0.98127969 0.97597094 0.97708857]\n",
            "Mean Score:  0.9777058290056365\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UIMr52jOuRnj",
        "colab_type": "code",
        "colab": {},
        "outputId": "000decad-6367-4db5-9ab2-4af9c0da5f69"
      },
      "source": [
        "confusion(AdaBoostClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix:  [[4025   29]\n",
            " [  83  338]]\n",
            "Classification Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.98      0.99      0.99      4054\n",
            "           1       0.92      0.80      0.86       421\n",
            "\n",
            "    accuracy                           0.97      4475\n",
            "   macro avg       0.95      0.90      0.92      4475\n",
            "weighted avg       0.97      0.97      0.97      4475\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QRoYIq9YuRnl",
        "colab_type": "text"
      },
      "source": [
        "Totals of 98% for precision, recall, and the f1-score are outstanding. The f1-score of the positive pulsar classification, the 1's, is 86%, nearly performing as well as RandomForestClassifier."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tdFZJXxquRnl",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X = housing_df.iloc[:,:-1]\n",
        "y = housing_df.iloc[:, -1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cUf_zeekuRnn",
        "colab_type": "code",
        "colab": {},
        "outputId": "cf2c0735-1cae-4ff2-cd75-3b3d3daad705"
      },
      "source": [
        "from sklearn.ensemble import AdaBoostRegressor\n",
        "regression_model_cv(AdaBoostRegressor())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reg rmse: [3.46011122 3.23658592 5.94132343 6.33384407 4.69895292]\n",
            "Reg mean: 4.734163512741491\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dzc-__LuuRnr",
        "colab_type": "text"
      },
      "source": [
        "# Using AdaBoost to Predict the Best Optimal Values"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J7ZCVzE1uRnr",
        "colab_type": "text"
      },
      "source": [
        "In this activity, you will use machine learning to solve a real-world problem. A bank wants to predict whether customers will return, also known as churn. They want to know which customers are most likely to leave. They give you their data, and they ask you to create a machine-learning algorithm to help them target the customers most likely to leave.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bycFtb76uRns",
        "colab_type": "code",
        "colab": {},
        "outputId": "8be2efb1-fd2f-4c73-dcbb-b50a0cb8a92f"
      },
      "source": [
        "df = pd.read_csv('CHURN.csv')\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>customerID</th>\n",
              "      <th>gender</th>\n",
              "      <th>SeniorCitizen</th>\n",
              "      <th>Partner</th>\n",
              "      <th>Dependents</th>\n",
              "      <th>tenure</th>\n",
              "      <th>PhoneService</th>\n",
              "      <th>MultipleLines</th>\n",
              "      <th>InternetService</th>\n",
              "      <th>OnlineSecurity</th>\n",
              "      <th>...</th>\n",
              "      <th>DeviceProtection</th>\n",
              "      <th>TechSupport</th>\n",
              "      <th>StreamingTV</th>\n",
              "      <th>StreamingMovies</th>\n",
              "      <th>Contract</th>\n",
              "      <th>PaperlessBilling</th>\n",
              "      <th>PaymentMethod</th>\n",
              "      <th>MonthlyCharges</th>\n",
              "      <th>TotalCharges</th>\n",
              "      <th>Churn</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>7590-VHVEG</td>\n",
              "      <td>Female</td>\n",
              "      <td>0</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>1</td>\n",
              "      <td>No</td>\n",
              "      <td>No phone service</td>\n",
              "      <td>DSL</td>\n",
              "      <td>No</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Electronic check</td>\n",
              "      <td>29.85</td>\n",
              "      <td>29.85</td>\n",
              "      <td>No</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>5575-GNVDE</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>34</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>One year</td>\n",
              "      <td>No</td>\n",
              "      <td>Mailed check</td>\n",
              "      <td>56.95</td>\n",
              "      <td>1889.5</td>\n",
              "      <td>No</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3668-QPYBK</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>2</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Mailed check</td>\n",
              "      <td>53.85</td>\n",
              "      <td>108.15</td>\n",
              "      <td>Yes</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7795-CFOCW</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>45</td>\n",
              "      <td>No</td>\n",
              "      <td>No phone service</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>One year</td>\n",
              "      <td>No</td>\n",
              "      <td>Bank transfer (automatic)</td>\n",
              "      <td>42.30</td>\n",
              "      <td>1840.75</td>\n",
              "      <td>No</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>9237-HQITU</td>\n",
              "      <td>Female</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>2</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>Fiber optic</td>\n",
              "      <td>No</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Electronic check</td>\n",
              "      <td>70.70</td>\n",
              "      <td>151.65</td>\n",
              "      <td>Yes</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 21 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \\\n",
              "0  7590-VHVEG  Female              0     Yes         No       1           No   \n",
              "1  5575-GNVDE    Male              0      No         No      34          Yes   \n",
              "2  3668-QPYBK    Male              0      No         No       2          Yes   \n",
              "3  7795-CFOCW    Male              0      No         No      45           No   \n",
              "4  9237-HQITU  Female              0      No         No       2          Yes   \n",
              "\n",
              "      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \\\n",
              "0  No phone service             DSL             No  ...               No   \n",
              "1                No             DSL            Yes  ...              Yes   \n",
              "2                No             DSL            Yes  ...               No   \n",
              "3  No phone service             DSL            Yes  ...              Yes   \n",
              "4                No     Fiber optic             No  ...               No   \n",
              "\n",
              "  TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling  \\\n",
              "0          No          No              No  Month-to-month              Yes   \n",
              "1          No          No              No        One year               No   \n",
              "2          No          No              No  Month-to-month              Yes   \n",
              "3         Yes          No              No        One year               No   \n",
              "4          No          No              No  Month-to-month              Yes   \n",
              "\n",
              "               PaymentMethod MonthlyCharges  TotalCharges Churn  \n",
              "0           Electronic check          29.85         29.85    No  \n",
              "1               Mailed check          56.95        1889.5    No  \n",
              "2               Mailed check          53.85        108.15   Yes  \n",
              "3  Bank transfer (automatic)          42.30       1840.75    No  \n",
              "4           Electronic check          70.70        151.65   Yes  \n",
              "\n",
              "[5 rows x 21 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 68
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "3HRBmZYEuRnu",
        "colab_type": "code",
        "colab": {},
        "outputId": "8f469c75-0c8f-43b5-ae40-fbc43db0dfef"
      },
      "source": [
        "df.describe()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SeniorCitizen</th>\n",
              "      <th>tenure</th>\n",
              "      <th>MonthlyCharges</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>7043.000000</td>\n",
              "      <td>7043.000000</td>\n",
              "      <td>7043.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>mean</th>\n",
              "      <td>0.162147</td>\n",
              "      <td>32.371149</td>\n",
              "      <td>64.761692</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>std</th>\n",
              "      <td>0.368612</td>\n",
              "      <td>24.559481</td>\n",
              "      <td>30.090047</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>min</th>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>18.250000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25%</th>\n",
              "      <td>0.000000</td>\n",
              "      <td>9.000000</td>\n",
              "      <td>35.500000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>50%</th>\n",
              "      <td>0.000000</td>\n",
              "      <td>29.000000</td>\n",
              "      <td>70.350000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75%</th>\n",
              "      <td>0.000000</td>\n",
              "      <td>55.000000</td>\n",
              "      <td>89.850000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>max</th>\n",
              "      <td>1.000000</td>\n",
              "      <td>72.000000</td>\n",
              "      <td>118.750000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       SeniorCitizen       tenure  MonthlyCharges\n",
              "count    7043.000000  7043.000000     7043.000000\n",
              "mean        0.162147    32.371149       64.761692\n",
              "std         0.368612    24.559481       30.090047\n",
              "min         0.000000     0.000000       18.250000\n",
              "25%         0.000000     9.000000       35.500000\n",
              "50%         0.000000    29.000000       70.350000\n",
              "75%         0.000000    55.000000       89.850000\n",
              "max         1.000000    72.000000      118.750000"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 70
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2CM8tz3auRnx",
        "colab_type": "code",
        "colab": {},
        "outputId": "8e16cc0e-e76d-4892-b899-0a2504ded4f0"
      },
      "source": [
        "df.info()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 7043 entries, 0 to 7042\n",
            "Data columns (total 21 columns):\n",
            " #   Column            Non-Null Count  Dtype  \n",
            "---  ------            --------------  -----  \n",
            " 0   customerID        7043 non-null   object \n",
            " 1   gender            7043 non-null   object \n",
            " 2   SeniorCitizen     7043 non-null   int64  \n",
            " 3   Partner           7043 non-null   object \n",
            " 4   Dependents        7043 non-null   object \n",
            " 5   tenure            7043 non-null   int64  \n",
            " 6   PhoneService      7043 non-null   object \n",
            " 7   MultipleLines     7043 non-null   object \n",
            " 8   InternetService   7043 non-null   object \n",
            " 9   OnlineSecurity    7043 non-null   object \n",
            " 10  OnlineBackup      7043 non-null   object \n",
            " 11  DeviceProtection  7043 non-null   object \n",
            " 12  TechSupport       7043 non-null   object \n",
            " 13  StreamingTV       7043 non-null   object \n",
            " 14  StreamingMovies   7043 non-null   object \n",
            " 15  Contract          7043 non-null   object \n",
            " 16  PaperlessBilling  7043 non-null   object \n",
            " 17  PaymentMethod     7043 non-null   object \n",
            " 18  MonthlyCharges    7043 non-null   float64\n",
            " 19  TotalCharges      7043 non-null   object \n",
            " 20  Churn             7043 non-null   object \n",
            "dtypes: float64(1), int64(2), object(18)\n",
            "memory usage: 1.1+ MB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dsrQ6_QmuRn2",
        "colab_type": "code",
        "colab": {},
        "outputId": "f02f8018-29a2-43be-e9b1-70c6656cd8e6"
      },
      "source": [
        "df.isna().any()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "customerID          False\n",
              "gender              False\n",
              "SeniorCitizen       False\n",
              "Partner             False\n",
              "Dependents          False\n",
              "tenure              False\n",
              "PhoneService        False\n",
              "MultipleLines       False\n",
              "InternetService     False\n",
              "OnlineSecurity      False\n",
              "OnlineBackup        False\n",
              "DeviceProtection    False\n",
              "TechSupport         False\n",
              "StreamingTV         False\n",
              "StreamingMovies     False\n",
              "Contract            False\n",
              "PaperlessBilling    False\n",
              "PaymentMethod       False\n",
              "MonthlyCharges      False\n",
              "TotalCharges        False\n",
              "Churn               False\n",
              "dtype: bool"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 69
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b7J9ExDYuRn7",
        "colab_type": "text"
      },
      "source": [
        "There are no NAN values"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "k6PvC_xJuRn7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df['Churn'] = df['Churn'].replace(to_replace=['No', 'Yes'], value=[0, 1])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "stKvXmazuRn9",
        "colab_type": "code",
        "colab": {},
        "outputId": "ed541088-86e8-4756-eac1-a2cd56721574"
      },
      "source": [
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>customerID</th>\n",
              "      <th>gender</th>\n",
              "      <th>SeniorCitizen</th>\n",
              "      <th>Partner</th>\n",
              "      <th>Dependents</th>\n",
              "      <th>tenure</th>\n",
              "      <th>PhoneService</th>\n",
              "      <th>MultipleLines</th>\n",
              "      <th>InternetService</th>\n",
              "      <th>OnlineSecurity</th>\n",
              "      <th>...</th>\n",
              "      <th>DeviceProtection</th>\n",
              "      <th>TechSupport</th>\n",
              "      <th>StreamingTV</th>\n",
              "      <th>StreamingMovies</th>\n",
              "      <th>Contract</th>\n",
              "      <th>PaperlessBilling</th>\n",
              "      <th>PaymentMethod</th>\n",
              "      <th>MonthlyCharges</th>\n",
              "      <th>TotalCharges</th>\n",
              "      <th>Churn</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>7590-VHVEG</td>\n",
              "      <td>Female</td>\n",
              "      <td>0</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>1</td>\n",
              "      <td>No</td>\n",
              "      <td>No phone service</td>\n",
              "      <td>DSL</td>\n",
              "      <td>No</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Electronic check</td>\n",
              "      <td>29.85</td>\n",
              "      <td>29.85</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>5575-GNVDE</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>34</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>One year</td>\n",
              "      <td>No</td>\n",
              "      <td>Mailed check</td>\n",
              "      <td>56.95</td>\n",
              "      <td>1889.5</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3668-QPYBK</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>2</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Mailed check</td>\n",
              "      <td>53.85</td>\n",
              "      <td>108.15</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7795-CFOCW</td>\n",
              "      <td>Male</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>45</td>\n",
              "      <td>No</td>\n",
              "      <td>No phone service</td>\n",
              "      <td>DSL</td>\n",
              "      <td>Yes</td>\n",
              "      <td>...</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>One year</td>\n",
              "      <td>No</td>\n",
              "      <td>Bank transfer (automatic)</td>\n",
              "      <td>42.30</td>\n",
              "      <td>1840.75</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>9237-HQITU</td>\n",
              "      <td>Female</td>\n",
              "      <td>0</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>2</td>\n",
              "      <td>Yes</td>\n",
              "      <td>No</td>\n",
              "      <td>Fiber optic</td>\n",
              "      <td>No</td>\n",
              "      <td>...</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>No</td>\n",
              "      <td>Month-to-month</td>\n",
              "      <td>Yes</td>\n",
              "      <td>Electronic check</td>\n",
              "      <td>70.70</td>\n",
              "      <td>151.65</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 21 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \\\n",
              "0  7590-VHVEG  Female              0     Yes         No       1           No   \n",
              "1  5575-GNVDE    Male              0      No         No      34          Yes   \n",
              "2  3668-QPYBK    Male              0      No         No       2          Yes   \n",
              "3  7795-CFOCW    Male              0      No         No      45           No   \n",
              "4  9237-HQITU  Female              0      No         No       2          Yes   \n",
              "\n",
              "      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \\\n",
              "0  No phone service             DSL             No  ...               No   \n",
              "1                No             DSL            Yes  ...              Yes   \n",
              "2                No             DSL            Yes  ...               No   \n",
              "3  No phone service             DSL            Yes  ...              Yes   \n",
              "4                No     Fiber optic             No  ...               No   \n",
              "\n",
              "  TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling  \\\n",
              "0          No          No              No  Month-to-month              Yes   \n",
              "1          No          No              No        One year               No   \n",
              "2          No          No              No  Month-to-month              Yes   \n",
              "3         Yes          No              No        One year               No   \n",
              "4          No          No              No  Month-to-month              Yes   \n",
              "\n",
              "               PaymentMethod MonthlyCharges  TotalCharges Churn  \n",
              "0           Electronic check          29.85         29.85     0  \n",
              "1               Mailed check          56.95        1889.5     0  \n",
              "2               Mailed check          53.85        108.15     1  \n",
              "3  Bank transfer (automatic)          42.30       1840.75     0  \n",
              "4           Electronic check          70.70        151.65     1  \n",
              "\n",
              "[5 rows x 21 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 73
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5rpjRmfhuRn_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X = df.iloc[:,1:-1]\n",
        "y = df.iloc[:, -1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_MpodgAluRoB",
        "colab_type": "code",
        "colab": {},
        "outputId": "8e414bec-50e1-435e-85c6-56a50505c5fc"
      },
      "source": [
        "X.shape, y.shape"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "((7043, 19), (7043,))"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 78
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gmqgc4jFuRoD",
        "colab_type": "code",
        "colab": {},
        "outputId": "57c50fc9-0f03-4fca-cb76-cc4987545e38"
      },
      "source": [
        "X = pd.get_dummies(X)\n",
        "X.shape"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(7043, 6575)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 79
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "t0YEpEbAuRoF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def clf_model_cv(model):\n",
        "    clf =model\n",
        "    scores = cross_val_score(clf, X, y)\n",
        "    print('Scores:', scores)\n",
        "    print('Mean Score: ',scores.mean())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ke5Do2qRuRoH",
        "colab_type": "code",
        "colab": {},
        "outputId": "73ded69a-90a9-4ea9-f6f8-4b12d9262137"
      },
      "source": [
        "clf_model_cv(LogisticRegression())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Scores: [0.80979418 0.81547197 0.78921221 0.80823864 0.79971591]\n",
            "Mean Score:  0.8044865797793405\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ULDvwqTPuRoL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.metrics import classification_report\n",
        "from sklearn.metrics import confusion_matrix\n",
        "from sklearn.model_selection import train_test_split\n",
        "X_train, X_test ,y_train, y_test = train_test_split(X, y, test_size = 0.25)\n",
        "def confusion(model):\n",
        "    clf = model\n",
        "    clf.fit(X_train, y_train)\n",
        "    y_pred = clf.predict(X_test)\n",
        "    print('Confusion Matrix:', confusion_matrix(y_test, y_pred))\n",
        "    print('Classfication Report:', classification_report(y_test, y_pred))\n",
        "    \n",
        "    return clf"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4DHGxG-fuRoN",
        "colab_type": "code",
        "colab": {},
        "outputId": "7db12be3-7c96-4a65-d5ad-359b57eb5935"
      },
      "source": [
        "confusion(AdaBoostClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix: [[1143  131]\n",
            " [ 220  267]]\n",
            "Classfication Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.84      0.90      0.87      1274\n",
            "           1       0.67      0.55      0.60       487\n",
            "\n",
            "    accuracy                           0.80      1761\n",
            "   macro avg       0.75      0.72      0.74      1761\n",
            "weighted avg       0.79      0.80      0.79      1761\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n",
              "                   n_estimators=50, random_state=None)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 95
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "RTWE2BaauRoP",
        "colab_type": "code",
        "colab": {},
        "outputId": "f6be2d58-9ac6-4ab3-e789-426e36ffca37"
      },
      "source": [
        "confusion(RandomForestClassifier())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix: [[1174  100]\n",
            " [ 255  232]]\n",
            "Classfication Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.82      0.92      0.87      1274\n",
            "           1       0.70      0.48      0.57       487\n",
            "\n",
            "    accuracy                           0.80      1761\n",
            "   macro avg       0.76      0.70      0.72      1761\n",
            "weighted avg       0.79      0.80      0.79      1761\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
              "                       criterion='gini', max_depth=None, max_features='auto',\n",
              "                       max_leaf_nodes=None, max_samples=None,\n",
              "                       min_impurity_decrease=0.0, min_impurity_split=None,\n",
              "                       min_samples_leaf=1, min_samples_split=2,\n",
              "                       min_weight_fraction_leaf=0.0, n_estimators=100,\n",
              "                       n_jobs=None, oob_score=False, random_state=None,\n",
              "                       verbose=0, warm_start=False)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 98
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WGRmWRDLuRoQ",
        "colab_type": "code",
        "colab": {},
        "outputId": "fcae5a4d-4cd8-4830-89d4-07b542941413"
      },
      "source": [
        "# We looked up AdaBoostClassifier() and discovered the n_estimators hyperparameter, similar to the n_estimators of Random Forests. We tried several out and came up with the following result for n_estimators=250:\n",
        "confusion(AdaBoostClassifier(n_estimators=250))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Confusion Matrix: [[1151  123]\n",
            " [ 225  262]]\n",
            "Classfication Report:               precision    recall  f1-score   support\n",
            "\n",
            "           0       0.84      0.90      0.87      1274\n",
            "           1       0.68      0.54      0.60       487\n",
            "\n",
            "    accuracy                           0.80      1761\n",
            "   macro avg       0.76      0.72      0.73      1761\n",
            "weighted avg       0.79      0.80      0.79      1761\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n",
              "                   n_estimators=250, random_state=None)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 97
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "q_vmM7xeuRoS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}