Skip to content

Instantly share code, notes, and snippets.

@Sanket758
Created April 10, 2020 15:33
Show Gist options
  • Save Sanket758/e43884f2cebc041f8d01e8e5ba36eed1 to your computer and use it in GitHub Desktop.
Save Sanket758/e43884f2cebc041f8d01e8e5ba36eed1 to your computer and use it in GitHub Desktop.
Machine Learning Workshop.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"colab": {
"name": "Machine Learning Workshop.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/Sanket758/e43884f2cebc041f8d01e8e5ba36eed1/machine-learning-workshop.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y8Dgd0Z2uRkJ",
"colab_type": "text"
},
"source": [
"# Introduction to Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SWtodtRquRkK",
"colab_type": "text"
},
"source": [
" linear regression is a very popular machine learning algorithm. Linear regression is worth trying whenever the target column is a continuous value. The value of a home is generally considered to be continuous. There is technically no limit to how high the cost of a home may be. It could take any value between two numbers, despite often rounding up."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N1HlB7L_uRkN",
"colab_type": "text"
},
"source": [
"### Using Linear Regression to Predict the Accuracy of the Hosing Prices"
]
},
{
"cell_type": "code",
"metadata": {
"id": "FtSugqREuRkN",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "feaZzMTzuRkT",
"colab_type": "code",
"colab": {},
"outputId": "ea763326-dfe5-4ebd-84cd-2f3035dd7be0"
},
"source": [
"# load data\n",
"housing_df = pd.read_csv('HousingData.csv')\n",
"housing_df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CRIM</th>\n",
" <th>ZN</th>\n",
" <th>INDUS</th>\n",
" <th>CHAS</th>\n",
" <th>NOX</th>\n",
" <th>RM</th>\n",
" <th>AGE</th>\n",
" <th>DIS</th>\n",
" <th>RAD</th>\n",
" <th>TAX</th>\n",
" <th>PTRATIO</th>\n",
" <th>B</th>\n",
" <th>LSTAT</th>\n",
" <th>MEDV</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.00632</td>\n",
" <td>18.0</td>\n",
" <td>2.31</td>\n",
" <td>0.0</td>\n",
" <td>0.538</td>\n",
" <td>6.575</td>\n",
" <td>65.2</td>\n",
" <td>4.0900</td>\n",
" <td>1</td>\n",
" <td>296</td>\n",
" <td>15.3</td>\n",
" <td>396.90</td>\n",
" <td>4.98</td>\n",
" <td>24.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.02731</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>6.421</td>\n",
" <td>78.9</td>\n",
" <td>4.9671</td>\n",
" <td>2</td>\n",
" <td>242</td>\n",
" <td>17.8</td>\n",
" <td>396.90</td>\n",
" <td>9.14</td>\n",
" <td>21.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.02729</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>7.185</td>\n",
" <td>61.1</td>\n",
" <td>4.9671</td>\n",
" <td>2</td>\n",
" <td>242</td>\n",
" <td>17.8</td>\n",
" <td>392.83</td>\n",
" <td>4.03</td>\n",
" <td>34.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.03237</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>6.998</td>\n",
" <td>45.8</td>\n",
" <td>6.0622</td>\n",
" <td>3</td>\n",
" <td>222</td>\n",
" <td>18.7</td>\n",
" <td>394.63</td>\n",
" <td>2.94</td>\n",
" <td>33.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.06905</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>7.147</td>\n",
" <td>54.2</td>\n",
" <td>6.0622</td>\n",
" <td>3</td>\n",
" <td>222</td>\n",
" <td>18.7</td>\n",
" <td>396.90</td>\n",
" <td>NaN</td>\n",
" <td>36.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n",
"0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 \n",
"1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 \n",
"2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 \n",
"3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 \n",
"4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 \n",
"\n",
" B LSTAT MEDV \n",
"0 396.90 4.98 24.0 \n",
"1 396.90 9.14 21.6 \n",
"2 392.83 4.03 34.7 \n",
"3 394.63 2.94 33.4 \n",
"4 396.90 NaN 36.2 "
]
},
"metadata": {
"tags": []
},
"execution_count": 2
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "QMddYy2HuRkY",
"colab_type": "code",
"colab": {}
},
"source": [
"# drop null values\n",
"housing_df = housing_df.dropna()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "_J-bAlP_uRkc",
"colab_type": "code",
"colab": {}
},
"source": [
"# declare X and y\n",
"X = housing_df.iloc[:,:-1]\n",
"y = housing_df.iloc[:, -1]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "4zMDRCKNuRkg",
"colab_type": "code",
"colab": {}
},
"source": [
"#Create training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "CScPOtfLuRkk",
"colab_type": "code",
"colab": {}
},
"source": [
"#Create the regressor: reg\n",
"reg = LinearRegression()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Pu5S_37uuRko",
"colab_type": "code",
"colab": {},
"outputId": "c7c60140-8caf-4e96-e073-0f3fa194781b"
},
"source": [
"#Fit the regressor to the training data\n",
"reg.fit(X_train, y_train)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "bw59KWZLuRks",
"colab_type": "code",
"colab": {}
},
"source": [
"# Predict on the test data: y_pred\n",
"y_pred = reg.predict(X_test)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ZoxRbdQQuRky",
"colab_type": "code",
"colab": {},
"outputId": "e327a7ac-8408-46b4-8054-79a76feea162"
},
"source": [
"# Compute and print RMSE\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Root Mean Squared Error: 3.331279959482406\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jEbl1xf5uRk1",
"colab_type": "text"
},
"source": [
"# Cross Validation\n",
"In cross-validation, also known as CV, the training data is split into five folds (any number will do, but five is standard). The machine learning algorithm is fit on one fold at a time and tested on the remaining data. The result is five different training and test sets that are all representative of the same data. The mean of the scores is usually taken as the accuracy of the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2YB8bascuRk2",
"colab_type": "text"
},
"source": [
"### Using the cross_val_score Function to Get Accurate Results on the Dataset "
]
},
{
"cell_type": "code",
"metadata": {
"id": "v6syZ8wKuRk2",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.model_selection import cross_val_score"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "-FGjsnxKuRk6",
"colab_type": "code",
"colab": {}
},
"source": [
"# Define the regression_model_cv function, which takes a fitted model as one parameter. The k = 5 hyperparameter gives the number of folds.\n",
"def regression_model_cv(model, k=5):\n",
" scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=k)\n",
" rmse = np.sqrt(-scores)\n",
" print('Reg rmse:', rmse)\n",
" print('Reg mean:', rmse.mean ())"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BUGvAr4_uRk-",
"colab_type": "text"
},
"source": [
"In sklearn, the scoring options are sometimes limited. Since mean_squared_error is not an option for cross_val_score, we choose the neg_mean_squared_error. cross_val_score takes the highest value by default, and the highest negative mean squared error is 0."
]
},
{
"cell_type": "code",
"metadata": {
"id": "xg4ol_TOuRk_",
"colab_type": "code",
"colab": {},
"outputId": "f3f9fceb-5429-4a9e-e92d-76212a6b2614"
},
"source": [
"regression_model_cv(LinearRegression())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.26123843 4.42712448 5.66151114 8.09493087 5.24453989]\n",
"Reg mean: 5.337868962878373\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "jKDzeadquRlC",
"colab_type": "code",
"colab": {},
"outputId": "fe3d8168-eec6-42c8-a5b7-484c5ba56620"
},
"source": [
"#Use the regression_model_cv function on the LinearRegression() model with 3 folds and then 6 folds, as shown in the following code snippet, for 3 folds:\n",
"regression_model_cv(LinearRegression(), k=3)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [ 3.72504914 6.01655701 23.20863933]\n",
"Reg mean: 10.983415161090685\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "OFZtNAA7uRlE",
"colab_type": "code",
"colab": {},
"outputId": "3b63d29e-115d-4617-c9ca-b5cecd08de3b"
},
"source": [
"# Now, test the values for 6 folds\n",
"regression_model_cv(LinearRegression(), k=6)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.23879491 3.97041949 5.58329663 3.92861033 9.88399671 3.91442679]\n",
"Reg mean: 5.08659081080109\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "66cprzOKuRlH",
"colab_type": "text"
},
"source": [
"# Regularization: Ridge and Lasso"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fACqT5XvuRlI",
"colab_type": "text"
},
"source": [
"**Regularization** is an important concept in machine learning; it's used to counteract overfitting. In the world of big data, it's easy to overfit data to the training set. When this happens, the model will often perform badly on the test set as indicated by mean_squared_error, or some other error.\n",
"\n",
"**Ridge** is a simple alternative to linear regression, designed to counteract overfitting. Ridge includes an L2 penalty term (L2 is based on Euclidean Distance) that shrinks the linear coefficients based on their size. The coefficients are the weights, numbers that determine how influential each column is on the output. Larger weights carry greater penalties in Ridge.\n",
"\n",
"**Lasso** is another regularized alternative to linear regression. Lasso adds a penalty equal to the absolute value of the magnitude of coefficients. This L1 regularization (L1 is taxicab distance) can eliminate some columns and result in a model that is sparse by comparison."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7vg-EB6QuRlJ",
"colab_type": "text"
},
"source": [
"Let's look at an example to check how Ridge and Lasso perform on our Boston Housing dataset."
]
},
{
"cell_type": "code",
"metadata": {
"id": "o5gnEL8nuRlJ",
"colab_type": "code",
"colab": {},
"outputId": "715d715f-5a90-4876-8e94-8d6f02e80f16"
},
"source": [
"#We begin by setting Ridge() as a parameter for regression_model_cv\n",
"from sklearn.linear_model import Ridge\n",
"regression_model_cv(Ridge())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.17202127 4.54972372 5.36604368 8.03715216 5.03988501]\n",
"Reg mean: 5.2329651662517715\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5b_-_hRouRlO",
"colab_type": "text"
},
"source": [
"It's not surprising that Ridge has a slightly better score than linear regression. This is because both algorithms use Euclidean distance and the linear regression model is overfitting the data by a slight amount. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "v3VdVCdzuRlP",
"colab_type": "code",
"colab": {},
"outputId": "b2fa550c-f0df-4f60-dc6d-4a37852ef696"
},
"source": [
"# Now, set Lasso() as the parameter for regression_model_cv:\n",
"from sklearn.linear_model import Lasso\n",
"regression_model_cv(Lasso())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.52318747 5.70083491 7.82318757 6.9878025 3.97229348]\n",
"Reg mean: 5.60146118538429\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cuqpngAHuRlT",
"colab_type": "text"
},
"source": [
"Whenever you're trying LinearRegression(), it's always worth trying Lasso and Ridge as well, since overfitting the data is common, and they only actually take a few lines of code to test. Lasso does not perform as well here because the L1 distance metric, taxicab distance, was not used in our model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qh37SX6XuRlU",
"colab_type": "text"
},
"source": [
"# K-Nearest Neighbors, Decision Trees, and Random Forests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "90yfFvMzuRlU",
"colab_type": "text"
},
"source": [
"Are there other machine learning algorithms, besides LinearRegression(), that is suitable for the Boston Housing dataset? Absolutely. There are many regressors in the scikit-learn library that may be used. Regressors are generally considered a class of machine learning algorithms that are suitable for continuous target values. In addition to Linear Regression, Ridge, and Lasso, we can try K-Nearest Neighbors, Decision Trees, and Random Forests. These models perform well on a wide range of datasets. Let's try them out and analyze them individually."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yvw6Y70SuRlV",
"colab_type": "text"
},
"source": [
"# K-Nearest Neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6DJ-dbV4uRlV",
"colab_type": "text"
},
"source": [
"The idea behind k-nearest neighbors (KNN) is straightforward. When choosing the output of a row with an unknown label, the prediction is the same as the output of its k-nearest neighbors, where k may be any whole number.\n",
"\n",
"For instance, let's say that k=3. Given an unknown label, we take n columns for this row and place them in n-dimensional space. Then we look for the three closest points. These points already have labels. We assume the majority label for our new point.\n",
"\n",
"KNN is commonly used for classification since classification is based on grouping values, but it can be applied to regression as well. When determining the value of a home, for instance, in our Boston Housing dataset, it makes sense to compare the values of homes in a similar location, with a similar number of bedrooms, a similar amount of square footage, and so on.\n",
"\n",
"You can always choose the number of neighbors for the algorithm and adjust it accordingly. The number of neighbors denoted here is k, which is also called a hyperparameter. In machine learning, the model parameters are derived during training, whereas the hyperparameters are chosen in advance.\n",
"\n",
"Fine-tuning hyperparameters is an essential task to master when building machine learning models. Learning the ins and outs of hyperparameter tuning takes time, practice, and experimentation."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "59k2Fev9uRlW",
"colab_type": "text"
},
"source": [
"### Using K-Nearest Neighbors "
]
},
{
"cell_type": "code",
"metadata": {
"id": "fgS11cU6uRlW",
"colab_type": "code",
"colab": {},
"outputId": "3a04f886-6f3a-4f10-b120-e6b935e5ca0f"
},
"source": [
"from sklearn.neighbors import KNeighborsRegressor\n",
"regression_model_cv(KNeighborsRegressor())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [ 8.24568226 8.81322798 10.58043836 8.85643441 5.98100069]\n",
"Reg mean: 8.495356738515685\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fVaW6jBSuRlZ",
"colab_type": "text"
},
"source": [
"We can change the number of neighbors to see if we can get better results. The default number of neighbors is 5. Let's change the number of neighbors to 4, 7, and 10."
]
},
{
"cell_type": "code",
"metadata": {
"id": "XkQYYkCquRlZ",
"colab_type": "code",
"colab": {},
"outputId": "4a022ea2-3bd5-4a21-f798-02ee93770bcc"
},
"source": [
"regression_model_cv(KNeighborsRegressor(n_neighbors=4))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [ 8.44659788 8.99814547 10.97170231 8.86647969 5.72114135]\n",
"Reg mean: 8.600813339223432\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "yueL2qBeuRlc",
"colab_type": "code",
"colab": {},
"outputId": "8c1e12b2-2268-4c98-f8b8-5b545ecdbc01"
},
"source": [
"regression_model_cv(KNeighborsRegressor(n_neighbors=7))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [ 7.99710601 8.68309183 10.66332898 8.90261573 5.51032355]\n",
"Reg mean: 8.351293217401393\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "3LXuO7MzuRlh",
"colab_type": "code",
"colab": {},
"outputId": "e6c59880-730a-4719-d2fb-68351aa04109"
},
"source": [
"regression_model_cv(KNeighborsRegressor(n_neighbors=10))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [ 7.47549287 8.62914556 10.69543822 8.91330686 6.52982222]\n",
"Reg mean: 8.448641147609868\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eyeq-raLuRll",
"colab_type": "text"
},
"source": [
"Scikit-learn provides a nice option to check a wide range of hyperparameters, which is GridSearchCV. The idea behind **GridSearchCV** is to use cross-validation to check all possible values in a grid. The value in the grid that gives the best result is then accepted as a hyperparameter."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lUjeEdmruRll",
"colab_type": "text"
},
"source": [
"# K-Nearest Neighbors with GridSearchCV to Find the Optimal Number of Neighbors"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ykJ-KjnSuRlm",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.model_selection import GridSearchCV"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "QOzpdWgTuRlq",
"colab_type": "code",
"colab": {}
},
"source": [
"# Now, choose the grid. The grid is the range of numbers – in this case, neighbors – that will be checked. Set up a hyperparameter grid for between 1 and 20 neighbors:\n",
"neighbors = np.linspace(1,20,20)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lawyv8GVuRls",
"colab_type": "text"
},
"source": [
"We achieve this with np.linspace(1, 20, 20), where the 1 is the first number, the first 20 is the last number, and the second 20 in the brackets is the number of intervals to count."
]
},
{
"cell_type": "code",
"metadata": {
"id": "GWdCMWH8uRlt",
"colab_type": "code",
"colab": {}
},
"source": [
"# Convert floats to int (required by knn):\n",
"k = neighbors.astype(int)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "BewLKhuuuRlv",
"colab_type": "code",
"colab": {}
},
"source": [
"# Now, place the grid in a dictionary\n",
"param_grid = { 'n_neighbors': k }"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "juHIQ-vyuRly",
"colab_type": "code",
"colab": {}
},
"source": [
"# Build the model for each neighbor:\n",
"knn = KNeighborsRegressor()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "VHZ4DV3FuRl1",
"colab_type": "code",
"colab": {}
},
"source": [
"# Instantiate the GridSearchCV object – knn_tuned:\n",
"knn_tuned = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "xz7YYbdouRl5",
"colab_type": "code",
"colab": {},
"outputId": "d3127ea2-be50-4678-b5e2-b7b116c26997"
},
"source": [
"#Fit knn_tuned to the data using .fit:\n",
"knn_tuned.fit(X, y)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GridSearchCV(cv=5, error_score=nan,\n",
" estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,\n",
" metric='minkowski',\n",
" metric_params=None, n_jobs=None,\n",
" n_neighbors=5, p=2,\n",
" weights='uniform'),\n",
" iid='deprecated', n_jobs=None,\n",
" param_grid={'n_neighbors': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,\n",
" 18, 19, 20])},\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
" scoring='neg_mean_squared_error', verbose=0)"
]
},
"metadata": {
"tags": []
},
"execution_count": 27
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "ewZwAmKBuRl7",
"colab_type": "code",
"colab": {},
"outputId": "6f6cb020-92e0-4982-8db4-5a8fc17f8fc3"
},
"source": [
"# Finally, you print the best parameter results, \n",
"k = knn_tuned.best_params_\n",
"print('Best n_neighbors: {}'.format(k))\n",
"score = knn_tuned.best_score_\n",
"rsm = np.sqrt(-score)\n",
"print('Best score: {}'.format(rsm))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Best n_neighbors: {'n_neighbors': 7}\n",
"Best score: 8.516767055977628\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H3KMCZUXuRl-",
"colab_type": "text"
},
"source": [
"# Decision Trees and Random Forests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-c6RTHG7uRl_",
"colab_type": "text"
},
"source": [
"Decision Trees are very good machine learning algorithms, but they are prone to overfitting. A random forest is an ensemble of decision trees. Random forests consistently outperform decision trees because their predictions generalize to data much better. A random forest may consist of hundreds of decision trees.\n",
"\n",
"A random forest is a great machine-learning algorithm to try on almost any dataset. Random forests work well with both regression and classification, and they often perform well out of the box."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tBt1zLe1uRl_",
"colab_type": "text"
},
"source": [
"### Decision Tree"
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "8EfpRcSouRmA",
"colab_type": "code",
"colab": {},
"outputId": "a1786eae-6514-4cf4-9ed1-e84223d3a96a"
},
"source": [
"from sklearn import tree\n",
"regression_model_cv(tree.DecisionTreeRegressor())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.8171001 5.74304202 8.26752837 6.79114278 5.57497844]\n",
"Reg mean: 6.038758341497116\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v0YesAAeuRmE",
"colab_type": "text"
},
"source": [
"### Random Forest"
]
},
{
"cell_type": "code",
"metadata": {
"id": "2YhoM3YWuRmE",
"colab_type": "code",
"colab": {},
"outputId": "f3355959-d6d0-4f19-a648-8226d4203dc4"
},
"source": [
"# Use RandomForestRegressor() as the input for regression_model_cv\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"regression_model_cv(RandomForestRegressor())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.21681204 3.58531755 4.46366413 6.48583246 4.02492014]\n",
"Reg mean: 4.355309264608133\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qs702qS2uRmH",
"colab_type": "text"
},
"source": [
"# Random Forest Hyperparameters\n",
"\n",
"Random forests have a lot of hyperparameters. Instead of going over them all, we will highlight the most important ones:\n",
"\n",
"**n_jobs(default=None)**: The number of jobs has to do with internal processing. None means 1. It's ideal to set n_jobs = -1 to permit the use of all processors. Although this does not improve the accuracy of the model, it does improve the speed.\n",
"\n",
"**n_estimators(default=10)**: The number of trees in the forest. The more trees, the better. The more trees, the more RAM is required. It's worth increasing this number until the algorithm moves too slowly. Although 1,000,000 trees may give better results than 1,000, the gain might be small enough to be negligible. A good starting point is 100, and 500 if time permits.\n",
"\n",
"**max_depth(default=None)**: The max depth of the trees in the forest. The deeper the trees, the more information is captured about the data, but the more prone the trees are to overfitting. When set to the default max_depth of None, there are no limitations, and each tree goes as deep as necessary. The max depth may be reduced to a smaller number of branches.\n",
"\n",
"**min_samples_split(default=2)**: This is the minimum number of samples required for a new branch or split to occur. This number can be increased to constrain the trees as they require more samples to make a decision.\n",
"\n",
"**min_samples_leaf(default=1)**: This is the same as min_samples_split, except it's the minimum number of samples at the leaves or the base of the tree. By increasing this number, the branch will stop splitting when it reaches this parameter.\n",
"\n",
"**max_features(default=\"auto\")**: The number of features to consider when looking for the best split. The default for regression is to consider the total number of columns. For classification random forests, sqrt is recommended."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QG1HsljhuRmH",
"colab_type": "text"
},
"source": [
"### Random Forest Tuned to Improve the Prediction on Our Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "gAIgUcQHuRmH",
"colab_type": "code",
"colab": {},
"outputId": "02c3d78b-fad1-4f25-9f5e-097877468540"
},
"source": [
"# Set n_jobs = -1 and n_estimators=100 for RandomForestRegressor as the input of regression_model_cv. We can always use n_jobs to speed up the algorithm, and we can increase n_estimators to achieve better results:\n",
"regression_model_cv(RandomForestRegressor(n_jobs=-1, n_estimators=100))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.26567684 3.70417176 5.00312032 6.47581253 4.15593728]\n",
"Reg mean: 4.520943745411145\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "faIMeX5buRmK",
"colab_type": "text"
},
"source": [
"Sklearn provides RandomizedSearchCV to check a wide range of hyperparameters. Instead of exhaustively going through a list, RandomizedSearchCV will check a set amount of random combinations and return the best results."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ixbrt9wjuRmL",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.model_selection import RandomizedSearchCV"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "T91Cg5esuRmO",
"colab_type": "code",
"colab": {}
},
"source": [
"# Set up the hyperparameter grid using max_depth\n",
"param_grid = { 'max_depth': [None, 10, 30 ,50,70,100,200,400],\n",
" 'min_samples_split': [2,3,4,5],\n",
" 'min_samples_leaf':[1,2,3],\n",
" 'max_features': ['auto','sqrt']}"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hk7BDoo1uRmR",
"colab_type": "code",
"colab": {}
},
"source": [
"reg = RandomForestRegressor(n_jobs = -1)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "6GIqIdIKuRmV",
"colab_type": "code",
"colab": {}
},
"source": [
"reg_tuned = RandomizedSearchCV(reg, param_grid, cv=5, scoring='neg_mean_squared_error')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "OHRbKB4GuRmY",
"colab_type": "code",
"colab": {},
"outputId": "52a709e4-ff9d-4341-dde6-1b5aa4838da2"
},
"source": [
"reg_tuned.fit(X,y)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RandomizedSearchCV(cv=5, error_score=nan,\n",
" estimator=RandomForestRegressor(bootstrap=True,\n",
" ccp_alpha=0.0,\n",
" criterion='mse',\n",
" max_depth=None,\n",
" max_features='auto',\n",
" max_leaf_nodes=None,\n",
" max_samples=None,\n",
" min_impurity_decrease=0.0,\n",
" min_impurity_split=None,\n",
" min_samples_leaf=1,\n",
" min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0,\n",
" n_estimators=100, n_jobs=-1,\n",
" oob_score=False,\n",
" random_state=None, verbose=0,\n",
" warm_start=False),\n",
" iid='deprecated', n_iter=10, n_jobs=None,\n",
" param_distributions={'max_depth': [None, 10, 30, 50, 70, 100,\n",
" 200, 400],\n",
" 'max_features': ['auto', 'sqrt'],\n",
" 'min_samples_leaf': [1, 2, 3],\n",
" 'min_samples_split': [2, 3, 4, 5]},\n",
" pre_dispatch='2*n_jobs', random_state=None, refit=True,\n",
" return_train_score=False, scoring='neg_mean_squared_error',\n",
" verbose=0)"
]
},
"metadata": {
"tags": []
},
"execution_count": 36
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "g2r3XbWiuRmb",
"colab_type": "code",
"colab": {},
"outputId": "3fe5770e-2c22-4bce-fe16-885dd39502ac"
},
"source": [
"p = reg_tuned.best_params_\n",
"print('Best n_neighbors:{}'.format(p))\n",
"score = reg_tuned.best_score_\n",
"rsm = np.sqrt(-score)\n",
"print('Best score: {}'.format(rsm))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Best n_neighbors:{'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 50}\n",
"Best score: 4.595081290129918\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "9cULuVUNuRmd",
"colab_type": "code",
"colab": {},
"outputId": "819d0a92-777d-4e0c-c76f-a22aef79071b"
},
"source": [
"# Now, run a random forest regressor with n_jobs = -1 and n_estimators = 500:\n",
"# Setup the hyperparameter grid\n",
"regression_model_cv(RandomForestRegressor(n_jobs=-1, n_estimators=500))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.19922811 3.7531576 4.81348515 6.50186152 3.92958863]\n",
"Reg mean: 4.439464203351062\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CUY6O8BBuRmh",
"colab_type": "text"
},
"source": [
"Hyperparameters are a primary key to building excellent machine learning models. Anyone with basic machine learning training can build machine learning models using default hyperparameters. Using GridSearchCV and RandomizedSearchCV to fine-tune hyperparameters to create more efficient models distinguishes advanced users from beginners."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9JjAHdp8uRmi",
"colab_type": "text"
},
"source": [
"# Classification Models\n",
"\n",
"The Boston Housing dataset was great for regression because the target column took on continuous values without limit. There are many cases when the target column takes on one or two values, such as TRUE or FALSE, or possibly a grouping of three or more values, such as RED, BLUE, or GREEN. When the target column may be split into distinct categories, the group of machine learning models that you should try are referred to as classification."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gfCLxLeXuRmi",
"colab_type": "text"
},
"source": [
"To make things interesting, let's load a new dataset used to detect pulsar stars in outer space."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ye0OYHzruRmj",
"colab_type": "code",
"colab": {},
"outputId": "a4ae741c-ca1c-4d67-ae70-bc835a1a9a76"
},
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"df = pd.read_csv('HTRU_2.csv')\n",
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>140.5625</th>\n",
" <th>55.68378214</th>\n",
" <th>-0.234571412</th>\n",
" <th>-0.699648398</th>\n",
" <th>3.199832776</th>\n",
" <th>19.11042633</th>\n",
" <th>7.975531794</th>\n",
" <th>74.24222492</th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>102.507812</td>\n",
" <td>58.882430</td>\n",
" <td>0.465318</td>\n",
" <td>-0.515088</td>\n",
" <td>1.677258</td>\n",
" <td>14.860146</td>\n",
" <td>10.576487</td>\n",
" <td>127.393580</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>103.015625</td>\n",
" <td>39.341649</td>\n",
" <td>0.323328</td>\n",
" <td>1.051164</td>\n",
" <td>3.121237</td>\n",
" <td>21.744669</td>\n",
" <td>7.735822</td>\n",
" <td>63.171909</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>136.750000</td>\n",
" <td>57.178449</td>\n",
" <td>-0.068415</td>\n",
" <td>-0.636238</td>\n",
" <td>3.642977</td>\n",
" <td>20.959280</td>\n",
" <td>6.896499</td>\n",
" <td>53.593661</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>88.726562</td>\n",
" <td>40.672225</td>\n",
" <td>0.600866</td>\n",
" <td>1.123492</td>\n",
" <td>1.178930</td>\n",
" <td>11.468720</td>\n",
" <td>14.269573</td>\n",
" <td>252.567306</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>93.570312</td>\n",
" <td>46.698114</td>\n",
" <td>0.531905</td>\n",
" <td>0.416721</td>\n",
" <td>1.636288</td>\n",
" <td>14.545074</td>\n",
" <td>10.621748</td>\n",
" <td>131.394004</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 140.5625 55.68378214 -0.234571412 -0.699648398 3.199832776 \\\n",
"0 102.507812 58.882430 0.465318 -0.515088 1.677258 \n",
"1 103.015625 39.341649 0.323328 1.051164 3.121237 \n",
"2 136.750000 57.178449 -0.068415 -0.636238 3.642977 \n",
"3 88.726562 40.672225 0.600866 1.123492 1.178930 \n",
"4 93.570312 46.698114 0.531905 0.416721 1.636288 \n",
"\n",
" 19.11042633 7.975531794 74.24222492 0 \n",
"0 14.860146 10.576487 127.393580 0 \n",
"1 21.744669 7.735822 63.171909 0 \n",
"2 20.959280 6.896499 53.593661 0 \n",
"3 11.468720 14.269573 252.567306 0 \n",
"4 14.545074 10.621748 131.394004 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 39
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vbjM2Y6puRml",
"colab_type": "text"
},
"source": [
"Looks interesting, and problematic. Notice that the column headers appear to be another row. It's impossible to analyze data without knowing what the columns are supposed to be, right?\n",
"\n",
"Note that the last column is all 0's in the DataFrame. This suggests that this is the Class column, which is our target column. When detecting the presence of something – in this case, pulsar stars – it's common to use a 1 for positive identification, and a 0 for a negative identification.\n",
"\n",
"Since Class is last in the list, let's assume that the columns are given in the correct order presented in the Attribute Information list. We can also assume that losing the current column headers, a negative identification among 17,898 rows is meaningless. The easiest way forward is simply to change the column headers to match the attribute list."
]
},
{
"cell_type": "code",
"metadata": {
"id": "e8WtkTbKuRmm",
"colab_type": "code",
"colab": {},
"outputId": "feb83ac2-cf34-4d51-8997-c61dacb1999b"
},
"source": [
"# Now, change column headers to match the official list and print the first five rows, as shown in the following code snippet:\n",
"df.columns = [['Mean of integrated profile', 'Standard deviation of integrated profile', \n",
" 'Excess kurtosis of integrated profile', 'Skewness of integrated profile',\n",
" 'Mean of DM-SNR curve', 'Standard deviation of DM-SNR curve',\n",
" 'Excess kurtosis of DM-SNR curve', 'Skewness of DM-SNR curve', 'Class' ]]\n",
"\n",
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>Mean of integrated profile</th>\n",
" <th>Standard deviation of integrated profile</th>\n",
" <th>Excess kurtosis of integrated profile</th>\n",
" <th>Skewness of integrated profile</th>\n",
" <th>Mean of DM-SNR curve</th>\n",
" <th>Standard deviation of DM-SNR curve</th>\n",
" <th>Excess kurtosis of DM-SNR curve</th>\n",
" <th>Skewness of DM-SNR curve</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>102.507812</td>\n",
" <td>58.882430</td>\n",
" <td>0.465318</td>\n",
" <td>-0.515088</td>\n",
" <td>1.677258</td>\n",
" <td>14.860146</td>\n",
" <td>10.576487</td>\n",
" <td>127.393580</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>103.015625</td>\n",
" <td>39.341649</td>\n",
" <td>0.323328</td>\n",
" <td>1.051164</td>\n",
" <td>3.121237</td>\n",
" <td>21.744669</td>\n",
" <td>7.735822</td>\n",
" <td>63.171909</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>136.750000</td>\n",
" <td>57.178449</td>\n",
" <td>-0.068415</td>\n",
" <td>-0.636238</td>\n",
" <td>3.642977</td>\n",
" <td>20.959280</td>\n",
" <td>6.896499</td>\n",
" <td>53.593661</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>88.726562</td>\n",
" <td>40.672225</td>\n",
" <td>0.600866</td>\n",
" <td>1.123492</td>\n",
" <td>1.178930</td>\n",
" <td>11.468720</td>\n",
" <td>14.269573</td>\n",
" <td>252.567306</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>93.570312</td>\n",
" <td>46.698114</td>\n",
" <td>0.531905</td>\n",
" <td>0.416721</td>\n",
" <td>1.636288</td>\n",
" <td>14.545074</td>\n",
" <td>10.621748</td>\n",
" <td>131.394004</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Mean of integrated profile Standard deviation of integrated profile \\\n",
"0 102.507812 58.882430 \n",
"1 103.015625 39.341649 \n",
"2 136.750000 57.178449 \n",
"3 88.726562 40.672225 \n",
"4 93.570312 46.698114 \n",
"\n",
" Excess kurtosis of integrated profile Skewness of integrated profile \\\n",
"0 0.465318 -0.515088 \n",
"1 0.323328 1.051164 \n",
"2 -0.068415 -0.636238 \n",
"3 0.600866 1.123492 \n",
"4 0.531905 0.416721 \n",
"\n",
" Mean of DM-SNR curve Standard deviation of DM-SNR curve \\\n",
"0 1.677258 14.860146 \n",
"1 3.121237 21.744669 \n",
"2 3.642977 20.959280 \n",
"3 1.178930 11.468720 \n",
"4 1.636288 14.545074 \n",
"\n",
" Excess kurtosis of DM-SNR curve Skewness of DM-SNR curve Class \n",
"0 10.576487 127.393580 0 \n",
"1 7.735822 63.171909 0 \n",
"2 6.896499 53.593661 0 \n",
"3 14.269573 252.567306 0 \n",
"4 10.621748 131.394004 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 40
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "GnQ_ueUfuRmo",
"colab_type": "code",
"colab": {},
"outputId": "3d29d2c8-e3dd-4584-b068-cd292de731aa"
},
"source": [
"df.info()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 17897 entries, 0 to 17896\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 (Mean of integrated profile,) 17897 non-null float64\n",
" 1 (Standard deviation of integrated profile,) 17897 non-null float64\n",
" 2 (Excess kurtosis of integrated profile,) 17897 non-null float64\n",
" 3 (Skewness of integrated profile,) 17897 non-null float64\n",
" 4 (Mean of DM-SNR curve,) 17897 non-null float64\n",
" 5 (Standard deviation of DM-SNR curve,) 17897 non-null float64\n",
" 6 (Excess kurtosis of DM-SNR curve,) 17897 non-null float64\n",
" 7 (Skewness of DM-SNR curve,) 17897 non-null float64\n",
" 8 (Class,) 17897 non-null int64 \n",
"dtypes: float64(8), int64(1)\n",
"memory usage: 1.2 MB\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "gwGElbTUuRmr",
"colab_type": "code",
"colab": {},
"outputId": "55cd6702-ed15-4e83-8b39-6197d82258dc"
},
"source": [
"df.shape"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(17897, 9)"
]
},
"metadata": {
"tags": []
},
"execution_count": 42
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "TcAfcOVmuRmu",
"colab_type": "code",
"colab": {},
"outputId": "fafcd332-a3a7-4022-e2c1-10ab6a60557b"
},
"source": [
"df.isnull().any()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Mean of integrated profile False\n",
"Standard deviation of integrated profile False\n",
"Excess kurtosis of integrated profile False\n",
"Skewness of integrated profile False\n",
"Mean of DM-SNR curve False\n",
"Standard deviation of DM-SNR curve False\n",
"Excess kurtosis of DM-SNR curve False\n",
"Skewness of DM-SNR curve False\n",
"Class False\n",
"dtype: bool"
]
},
"metadata": {
"tags": []
},
"execution_count": 43
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_g7h5DKxuRmw",
"colab_type": "text"
},
"source": [
"We know that there are no null values. If there were null values, we would need to eliminate the rows or fill them in by taking the mean, the median, the mode, or another value from the columns."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F4y2uqaXuRmw",
"colab_type": "text"
},
"source": [
"When it comes to preparing data for machine learning, it's essential to have clean, numerical data with no null values. Further data analysis is often warranted, depending upon the goal at hand. If the goal is simply to try out some models and check them for accuracy, it's fine to go ahead. If the goal is to uncover deep insights about the data, further statistical analysis, as introduced in the previous chapter, is always warranted."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uVPn6xhvuRmw",
"colab_type": "text"
},
"source": [
"# Logistic Regression\n",
"\n",
"When it comes to datasets that classify points, logistic regression is one of the most popular and successful machine learning algorithms. Logistic regression utilizes the sigmoid function to determine whether points should approach one value or the other. As the following diagram indicates, it's a good idea to classify the target values as 0 and 1 when utilizing logistic regression. In the pulsar dataset, the values are already classified as 0s and 1s. If the dataset was labeled as Red and Blue, converting them in advance to 0 and 1 would be essential"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XBgypeSNuRmx",
"colab_type": "text"
},
"source": [
"### Using Logistic Regression to Predict Data Accuracy"
]
},
{
"cell_type": "code",
"metadata": {
"id": "QOvhwA6ZuRmx",
"colab_type": "code",
"colab": {}
},
"source": [
"# Import LogisticRegression:\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.linear_model import LogisticRegression\n",
"from warnings import filterwarnings\n",
"filterwarnings('ignore')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "1s6LIitcuRm1",
"colab_type": "code",
"colab": {}
},
"source": [
"# Set up matrices X and y to store the predictors and response variables, respectively:\n",
"X = df.iloc[:, 0:8]\n",
"y = df.iloc[:, 8]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "6BTzToKluRm4",
"colab_type": "code",
"colab": {}
},
"source": [
"# Build Model\n",
"def clf_model(model):\n",
" clf =model\n",
" scores = cross_val_score(clf, X, y)\n",
" print('Scores:', scores)\n",
" print('Mean Score: ',scores.mean())"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "aa5q4qRPuRm7",
"colab_type": "code",
"colab": {},
"outputId": "1e0143c8-0c78-43fa-cd61-70c77e3f0666"
},
"source": [
"clf_model(LogisticRegression())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.97486034 0.97877095 0.98127969 0.9779268 0.9782062 ]\n",
"Mean Score: 0.9782087940047546\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oYq4ZPh8uRm9",
"colab_type": "text"
},
"source": [
"Logistic regression is very different than linear regression. Logistic regression uses the sigmoid function to classify all instances into one group or the other. Generally speaking, all cases that are above 0.5 are classified as a 1, and all cases that fall below 0.5 are classified as a 0, with decimals that are close to 1 more likely to be a 1, and decimals that are close to 0 more likely to be a 0. Linear regression, by contrast, finds a straight line that minimizes the error between the straight line and the individual points. Logistic regression classifies all points into two groups; all new points will fall into one of these groups. Linear regression finds a line of best fit; all new points may fall anywhere on the line and take on any value."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rzQ_9oKiuRm9",
"colab_type": "text"
},
"source": [
"# Naive Bayes\n",
"\n",
"Naive Bayes is a model based on Bayes' theorem, a famous probability theorem based on a conditional probability that assumes independent events. Similarly, Naive Bayes assumes independent attributes or columns."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fnmahcPwuRm_",
"colab_type": "text"
},
"source": [
"### Using GaussianNB, KneighborsClassifier, DecisionTreeClassifier, and RandomForestClassifier to Predict Accuracy in Our Dataset\n",
"\n",
"The goal of this exercise is to predict pulsars using a variety of classifiers, including GaussianNB, KneighborsClassifier, DecisionTreeClassifier, and RandomForestClassifier."
]
},
{
"cell_type": "code",
"metadata": {
"id": "8N9X7_dhuRm_",
"colab_type": "code",
"colab": {},
"outputId": "f6cd7159-3861-4296-8df1-8b51fbb83d11"
},
"source": [
"# GuassianNB\n",
"from sklearn.naive_bayes import GaussianNB\n",
"clf_model(GaussianNB())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.96061453 0.92374302 0.94272143 0.92847164 0.96451523]\n",
"Mean Score: 0.9440131680613636\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "rDLaaou3uRnC",
"colab_type": "code",
"colab": {},
"outputId": "4b6b0682-740e-41be-b92a-079f5170e472"
},
"source": [
"# kNeighborsClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"clf_model(KNeighborsClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.96955307 0.96927374 0.97317687 0.9706622 0.97289746]\n",
"Mean Score: 0.9711126668446134\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Sg8UhDj-uRnI",
"colab_type": "code",
"colab": {},
"outputId": "e9e8dfe1-c7f5-44fa-a0fc-d5ef3c45c1da"
},
"source": [
"# Decision Tree Clf\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"clf_model(DecisionTreeClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.96703911 0.96424581 0.96842693 0.963677 0.96842693]\n",
"Mean Score: 0.966363158149416\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "jrB1Kn7juRnL",
"colab_type": "code",
"colab": {},
"outputId": "348a5e05-72a2-4f75-a229-f8e4d5cb729a"
},
"source": [
"# Random Forest Clf\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"clf_model(RandomForestClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.97765363 0.98184358 0.97988265 0.97541213 0.97736798]\n",
"Mean Score: 0.9784319923326793\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XiCewXpJuRnN",
"colab_type": "text"
},
"source": [
"#### All classifiers have achieved between 94% and 98% accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VMeGC6mQuRnN",
"colab_type": "text"
},
"source": [
"You may also wonder how to know when to use these classifiers. The bottom line is that whenever you have a classification problem, meaning that the data has a target column with a finite number of options, such as three kinds of wine, all classifiers are worth trying. Naive Bayes is known to work well with text data, and random forests are known to work well generally."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ACUtocv3uRnO",
"colab_type": "text"
},
"source": [
"# Confusion Matrix and Classification Report for the Pulsar Dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "DQ5pQQcbuRnP",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import confusion_matrix"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "OZSujmWpuRnS",
"colab_type": "code",
"colab": {}
},
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "mngvEembuRnW",
"colab_type": "code",
"colab": {}
},
"source": [
"def confusion(model):\n",
" clf = model\n",
" clf.fit(X_train, y_train)\n",
" y_pred = clf.predict(X_test)\n",
" print('Confusion Matrix: ', confusion_matrix(y_test, y_pred))\n",
" print('Classification Report:', classification_report(y_test, y_pred))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "_NIaHsEquRnZ",
"colab_type": "code",
"colab": {},
"outputId": "5e976300-59df-4c6c-e00c-6f576f2c7f70"
},
"source": [
"confusion(LogisticRegression())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[4030 24]\n",
" [ 77 344]]\n",
"Classification Report: precision recall f1-score support\n",
"\n",
" 0 0.98 0.99 0.99 4054\n",
" 1 0.93 0.82 0.87 421\n",
"\n",
" accuracy 0.98 4475\n",
" macro avg 0.96 0.91 0.93 4475\n",
"weighted avg 0.98 0.98 0.98 4475\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "4WJRVATXuRnb",
"colab_type": "code",
"colab": {},
"outputId": "3aeeb820-fa9d-4f00-9995-037e04098086"
},
"source": [
"confusion(KNeighborsClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[4018 36]\n",
" [ 88 333]]\n",
"Classification Report: precision recall f1-score support\n",
"\n",
" 0 0.98 0.99 0.98 4054\n",
" 1 0.90 0.79 0.84 421\n",
"\n",
" accuracy 0.97 4475\n",
" macro avg 0.94 0.89 0.91 4475\n",
"weighted avg 0.97 0.97 0.97 4475\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "UTp4W7eAuRnd",
"colab_type": "code",
"colab": {},
"outputId": "f3429b9c-98d6-416c-e8c7-6f8fae0a9557"
},
"source": [
"confusion(GaussianNB())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[3883 171]\n",
" [ 64 357]]\n",
"Classification Report: precision recall f1-score support\n",
"\n",
" 0 0.98 0.96 0.97 4054\n",
" 1 0.68 0.85 0.75 421\n",
"\n",
" accuracy 0.95 4475\n",
" macro avg 0.83 0.90 0.86 4475\n",
"weighted avg 0.95 0.95 0.95 4475\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "1Sk0XvaPuRnf",
"colab_type": "code",
"colab": {},
"outputId": "9f8fc21e-4896-40c5-b550-add8d3e5c1b6"
},
"source": [
"confusion(RandomForestClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[4028 26]\n",
" [ 71 350]]\n",
"Classification Report: precision recall f1-score support\n",
"\n",
" 0 0.98 0.99 0.99 4054\n",
" 1 0.93 0.83 0.88 421\n",
"\n",
" accuracy 0.98 4475\n",
" macro avg 0.96 0.91 0.93 4475\n",
"weighted avg 0.98 0.98 0.98 4475\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o2bbrIY8uRng",
"colab_type": "text"
},
"source": [
"# Boosting Methods\n",
"\n",
"Random Forests are a type of bagging method. A bagging method is a machine learning method that aggregates a large sum of machine learning models. In the case of Random Forests, the aggregates are decision trees.\n",
"\n",
"Another machine learning method is boosting. The idea behind boosting is to transform a weak learner into a strong learner by modifying the weights for the rows that the learner got wrong. A weak learner may have an error of 49%, hardly better than a coin flip. A strong learner, by contrast, may have an error rate of 1 or 2 %. With enough iterations, very weak learners can be transformed into very strong learners."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cxKoul2euRnh",
"colab_type": "text"
},
"source": [
"### Using AdaBoost to Predict the Best Optimal Values"
]
},
{
"cell_type": "code",
"metadata": {
"id": "SMdAX7bZuRnh",
"colab_type": "code",
"colab": {},
"outputId": "d8a979da-c287-478f-9133-83f6231df8b8"
},
"source": [
"from sklearn.ensemble import AdaBoostClassifier\n",
"clf_model(AdaBoostClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.97430168 0.97988827 0.98127969 0.97597094 0.97708857]\n",
"Mean Score: 0.9777058290056365\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "UIMr52jOuRnj",
"colab_type": "code",
"colab": {},
"outputId": "000decad-6367-4db5-9ab2-4af9c0da5f69"
},
"source": [
"confusion(AdaBoostClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[4025 29]\n",
" [ 83 338]]\n",
"Classification Report: precision recall f1-score support\n",
"\n",
" 0 0.98 0.99 0.99 4054\n",
" 1 0.92 0.80 0.86 421\n",
"\n",
" accuracy 0.97 4475\n",
" macro avg 0.95 0.90 0.92 4475\n",
"weighted avg 0.97 0.97 0.97 4475\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QRoYIq9YuRnl",
"colab_type": "text"
},
"source": [
"Totals of 98% for precision, recall, and the f1-score are outstanding. The f1-score of the positive pulsar classification, the 1's, is 86%, nearly performing as well as RandomForestClassifier."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tdFZJXxquRnl",
"colab_type": "code",
"colab": {}
},
"source": [
"X = housing_df.iloc[:,:-1]\n",
"y = housing_df.iloc[:, -1]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "cUf_zeekuRnn",
"colab_type": "code",
"colab": {},
"outputId": "cf2c0735-1cae-4ff2-cd75-3b3d3daad705"
},
"source": [
"from sklearn.ensemble import AdaBoostRegressor\n",
"regression_model_cv(AdaBoostRegressor())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Reg rmse: [3.46011122 3.23658592 5.94132343 6.33384407 4.69895292]\n",
"Reg mean: 4.734163512741491\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dzc-__LuuRnr",
"colab_type": "text"
},
"source": [
"# Using AdaBoost to Predict the Best Optimal Values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J7ZCVzE1uRnr",
"colab_type": "text"
},
"source": [
"In this activity, you will use machine learning to solve a real-world problem. A bank wants to predict whether customers will return, also known as churn. They want to know which customers are most likely to leave. They give you their data, and they ask you to create a machine-learning algorithm to help them target the customers most likely to leave.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "bycFtb76uRns",
"colab_type": "code",
"colab": {},
"outputId": "8be2efb1-fd2f-4c73-dcbb-b50a0cb8a92f"
},
"source": [
"df = pd.read_csv('CHURN.csv')\n",
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customerID</th>\n",
" <th>gender</th>\n",
" <th>SeniorCitizen</th>\n",
" <th>Partner</th>\n",
" <th>Dependents</th>\n",
" <th>tenure</th>\n",
" <th>PhoneService</th>\n",
" <th>MultipleLines</th>\n",
" <th>InternetService</th>\n",
" <th>OnlineSecurity</th>\n",
" <th>...</th>\n",
" <th>DeviceProtection</th>\n",
" <th>TechSupport</th>\n",
" <th>StreamingTV</th>\n",
" <th>StreamingMovies</th>\n",
" <th>Contract</th>\n",
" <th>PaperlessBilling</th>\n",
" <th>PaymentMethod</th>\n",
" <th>MonthlyCharges</th>\n",
" <th>TotalCharges</th>\n",
" <th>Churn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7590-VHVEG</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>1</td>\n",
" <td>No</td>\n",
" <td>No phone service</td>\n",
" <td>DSL</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Electronic check</td>\n",
" <td>29.85</td>\n",
" <td>29.85</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5575-GNVDE</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>34</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>One year</td>\n",
" <td>No</td>\n",
" <td>Mailed check</td>\n",
" <td>56.95</td>\n",
" <td>1889.5</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3668-QPYBK</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Mailed check</td>\n",
" <td>53.85</td>\n",
" <td>108.15</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7795-CFOCW</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>45</td>\n",
" <td>No</td>\n",
" <td>No phone service</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>One year</td>\n",
" <td>No</td>\n",
" <td>Bank transfer (automatic)</td>\n",
" <td>42.30</td>\n",
" <td>1840.75</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9237-HQITU</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Fiber optic</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Electronic check</td>\n",
" <td>70.70</td>\n",
" <td>151.65</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n",
"0 7590-VHVEG Female 0 Yes No 1 No \n",
"1 5575-GNVDE Male 0 No No 34 Yes \n",
"2 3668-QPYBK Male 0 No No 2 Yes \n",
"3 7795-CFOCW Male 0 No No 45 No \n",
"4 9237-HQITU Female 0 No No 2 Yes \n",
"\n",
" MultipleLines InternetService OnlineSecurity ... DeviceProtection \\\n",
"0 No phone service DSL No ... No \n",
"1 No DSL Yes ... Yes \n",
"2 No DSL Yes ... No \n",
"3 No phone service DSL Yes ... Yes \n",
"4 No Fiber optic No ... No \n",
"\n",
" TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \\\n",
"0 No No No Month-to-month Yes \n",
"1 No No No One year No \n",
"2 No No No Month-to-month Yes \n",
"3 Yes No No One year No \n",
"4 No No No Month-to-month Yes \n",
"\n",
" PaymentMethod MonthlyCharges TotalCharges Churn \n",
"0 Electronic check 29.85 29.85 No \n",
"1 Mailed check 56.95 1889.5 No \n",
"2 Mailed check 53.85 108.15 Yes \n",
"3 Bank transfer (automatic) 42.30 1840.75 No \n",
"4 Electronic check 70.70 151.65 Yes \n",
"\n",
"[5 rows x 21 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 68
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "3HRBmZYEuRnu",
"colab_type": "code",
"colab": {},
"outputId": "8f469c75-0c8f-43b5-ae40-fbc43db0dfef"
},
"source": [
"df.describe()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SeniorCitizen</th>\n",
" <th>tenure</th>\n",
" <th>MonthlyCharges</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>7043.000000</td>\n",
" <td>7043.000000</td>\n",
" <td>7043.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.162147</td>\n",
" <td>32.371149</td>\n",
" <td>64.761692</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.368612</td>\n",
" <td>24.559481</td>\n",
" <td>30.090047</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>18.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000</td>\n",
" <td>9.000000</td>\n",
" <td>35.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.000000</td>\n",
" <td>29.000000</td>\n",
" <td>70.350000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.000000</td>\n",
" <td>55.000000</td>\n",
" <td>89.850000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" <td>72.000000</td>\n",
" <td>118.750000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SeniorCitizen tenure MonthlyCharges\n",
"count 7043.000000 7043.000000 7043.000000\n",
"mean 0.162147 32.371149 64.761692\n",
"std 0.368612 24.559481 30.090047\n",
"min 0.000000 0.000000 18.250000\n",
"25% 0.000000 9.000000 35.500000\n",
"50% 0.000000 29.000000 70.350000\n",
"75% 0.000000 55.000000 89.850000\n",
"max 1.000000 72.000000 118.750000"
]
},
"metadata": {
"tags": []
},
"execution_count": 70
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "2CM8tz3auRnx",
"colab_type": "code",
"colab": {},
"outputId": "8e16cc0e-e76d-4892-b899-0a2504ded4f0"
},
"source": [
"df.info()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 7043 entries, 0 to 7042\n",
"Data columns (total 21 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 customerID 7043 non-null object \n",
" 1 gender 7043 non-null object \n",
" 2 SeniorCitizen 7043 non-null int64 \n",
" 3 Partner 7043 non-null object \n",
" 4 Dependents 7043 non-null object \n",
" 5 tenure 7043 non-null int64 \n",
" 6 PhoneService 7043 non-null object \n",
" 7 MultipleLines 7043 non-null object \n",
" 8 InternetService 7043 non-null object \n",
" 9 OnlineSecurity 7043 non-null object \n",
" 10 OnlineBackup 7043 non-null object \n",
" 11 DeviceProtection 7043 non-null object \n",
" 12 TechSupport 7043 non-null object \n",
" 13 StreamingTV 7043 non-null object \n",
" 14 StreamingMovies 7043 non-null object \n",
" 15 Contract 7043 non-null object \n",
" 16 PaperlessBilling 7043 non-null object \n",
" 17 PaymentMethod 7043 non-null object \n",
" 18 MonthlyCharges 7043 non-null float64\n",
" 19 TotalCharges 7043 non-null object \n",
" 20 Churn 7043 non-null object \n",
"dtypes: float64(1), int64(2), object(18)\n",
"memory usage: 1.1+ MB\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "dsrQ6_QmuRn2",
"colab_type": "code",
"colab": {},
"outputId": "f02f8018-29a2-43be-e9b1-70c6656cd8e6"
},
"source": [
"df.isna().any()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"customerID False\n",
"gender False\n",
"SeniorCitizen False\n",
"Partner False\n",
"Dependents False\n",
"tenure False\n",
"PhoneService False\n",
"MultipleLines False\n",
"InternetService False\n",
"OnlineSecurity False\n",
"OnlineBackup False\n",
"DeviceProtection False\n",
"TechSupport False\n",
"StreamingTV False\n",
"StreamingMovies False\n",
"Contract False\n",
"PaperlessBilling False\n",
"PaymentMethod False\n",
"MonthlyCharges False\n",
"TotalCharges False\n",
"Churn False\n",
"dtype: bool"
]
},
"metadata": {
"tags": []
},
"execution_count": 69
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b7J9ExDYuRn7",
"colab_type": "text"
},
"source": [
"There are no NAN values"
]
},
{
"cell_type": "code",
"metadata": {
"id": "k6PvC_xJuRn7",
"colab_type": "code",
"colab": {}
},
"source": [
"df['Churn'] = df['Churn'].replace(to_replace=['No', 'Yes'], value=[0, 1])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "stKvXmazuRn9",
"colab_type": "code",
"colab": {},
"outputId": "ed541088-86e8-4756-eac1-a2cd56721574"
},
"source": [
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customerID</th>\n",
" <th>gender</th>\n",
" <th>SeniorCitizen</th>\n",
" <th>Partner</th>\n",
" <th>Dependents</th>\n",
" <th>tenure</th>\n",
" <th>PhoneService</th>\n",
" <th>MultipleLines</th>\n",
" <th>InternetService</th>\n",
" <th>OnlineSecurity</th>\n",
" <th>...</th>\n",
" <th>DeviceProtection</th>\n",
" <th>TechSupport</th>\n",
" <th>StreamingTV</th>\n",
" <th>StreamingMovies</th>\n",
" <th>Contract</th>\n",
" <th>PaperlessBilling</th>\n",
" <th>PaymentMethod</th>\n",
" <th>MonthlyCharges</th>\n",
" <th>TotalCharges</th>\n",
" <th>Churn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7590-VHVEG</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>1</td>\n",
" <td>No</td>\n",
" <td>No phone service</td>\n",
" <td>DSL</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Electronic check</td>\n",
" <td>29.85</td>\n",
" <td>29.85</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5575-GNVDE</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>34</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>One year</td>\n",
" <td>No</td>\n",
" <td>Mailed check</td>\n",
" <td>56.95</td>\n",
" <td>1889.5</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3668-QPYBK</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Mailed check</td>\n",
" <td>53.85</td>\n",
" <td>108.15</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7795-CFOCW</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>45</td>\n",
" <td>No</td>\n",
" <td>No phone service</td>\n",
" <td>DSL</td>\n",
" <td>Yes</td>\n",
" <td>...</td>\n",
" <td>Yes</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>One year</td>\n",
" <td>No</td>\n",
" <td>Bank transfer (automatic)</td>\n",
" <td>42.30</td>\n",
" <td>1840.75</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9237-HQITU</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2</td>\n",
" <td>Yes</td>\n",
" <td>No</td>\n",
" <td>Fiber optic</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>Month-to-month</td>\n",
" <td>Yes</td>\n",
" <td>Electronic check</td>\n",
" <td>70.70</td>\n",
" <td>151.65</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n",
"0 7590-VHVEG Female 0 Yes No 1 No \n",
"1 5575-GNVDE Male 0 No No 34 Yes \n",
"2 3668-QPYBK Male 0 No No 2 Yes \n",
"3 7795-CFOCW Male 0 No No 45 No \n",
"4 9237-HQITU Female 0 No No 2 Yes \n",
"\n",
" MultipleLines InternetService OnlineSecurity ... DeviceProtection \\\n",
"0 No phone service DSL No ... No \n",
"1 No DSL Yes ... Yes \n",
"2 No DSL Yes ... No \n",
"3 No phone service DSL Yes ... Yes \n",
"4 No Fiber optic No ... No \n",
"\n",
" TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \\\n",
"0 No No No Month-to-month Yes \n",
"1 No No No One year No \n",
"2 No No No Month-to-month Yes \n",
"3 Yes No No One year No \n",
"4 No No No Month-to-month Yes \n",
"\n",
" PaymentMethod MonthlyCharges TotalCharges Churn \n",
"0 Electronic check 29.85 29.85 0 \n",
"1 Mailed check 56.95 1889.5 0 \n",
"2 Mailed check 53.85 108.15 1 \n",
"3 Bank transfer (automatic) 42.30 1840.75 0 \n",
"4 Electronic check 70.70 151.65 1 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 73
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "5rpjRmfhuRn_",
"colab_type": "code",
"colab": {}
},
"source": [
"X = df.iloc[:,1:-1]\n",
"y = df.iloc[:, -1]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "_MpodgAluRoB",
"colab_type": "code",
"colab": {},
"outputId": "8e414bec-50e1-435e-85c6-56a50505c5fc"
},
"source": [
"X.shape, y.shape"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((7043, 19), (7043,))"
]
},
"metadata": {
"tags": []
},
"execution_count": 78
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "gmqgc4jFuRoD",
"colab_type": "code",
"colab": {},
"outputId": "57c50fc9-0f03-4fca-cb76-cc4987545e38"
},
"source": [
"X = pd.get_dummies(X)\n",
"X.shape"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(7043, 6575)"
]
},
"metadata": {
"tags": []
},
"execution_count": 79
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "t0YEpEbAuRoF",
"colab_type": "code",
"colab": {}
},
"source": [
"def clf_model_cv(model):\n",
" clf =model\n",
" scores = cross_val_score(clf, X, y)\n",
" print('Scores:', scores)\n",
" print('Mean Score: ',scores.mean())"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Ke5Do2qRuRoH",
"colab_type": "code",
"colab": {},
"outputId": "73ded69a-90a9-4ea9-f6f8-4b12d9262137"
},
"source": [
"clf_model_cv(LogisticRegression())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Scores: [0.80979418 0.81547197 0.78921221 0.80823864 0.79971591]\n",
"Mean Score: 0.8044865797793405\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "ULDvwqTPuRoL",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import confusion_matrix\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test ,y_train, y_test = train_test_split(X, y, test_size = 0.25)\n",
"def confusion(model):\n",
" clf = model\n",
" clf.fit(X_train, y_train)\n",
" y_pred = clf.predict(X_test)\n",
" print('Confusion Matrix:', confusion_matrix(y_test, y_pred))\n",
" print('Classfication Report:', classification_report(y_test, y_pred))\n",
" \n",
" return clf"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "4DHGxG-fuRoN",
"colab_type": "code",
"colab": {},
"outputId": "7db12be3-7c96-4a65-d5ad-359b57eb5935"
},
"source": [
"confusion(AdaBoostClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[1143 131]\n",
" [ 220 267]]\n",
"Classfication Report: precision recall f1-score support\n",
"\n",
" 0 0.84 0.90 0.87 1274\n",
" 1 0.67 0.55 0.60 487\n",
"\n",
" accuracy 0.80 1761\n",
" macro avg 0.75 0.72 0.74 1761\n",
"weighted avg 0.79 0.80 0.79 1761\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n",
" n_estimators=50, random_state=None)"
]
},
"metadata": {
"tags": []
},
"execution_count": 95
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": false,
"id": "RTWE2BaauRoP",
"colab_type": "code",
"colab": {},
"outputId": "f6be2d58-9ac6-4ab3-e789-426e36ffca37"
},
"source": [
"confusion(RandomForestClassifier())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[1174 100]\n",
" [ 255 232]]\n",
"Classfication Report: precision recall f1-score support\n",
"\n",
" 0 0.82 0.92 0.87 1274\n",
" 1 0.70 0.48 0.57 487\n",
"\n",
" accuracy 0.80 1761\n",
" macro avg 0.76 0.70 0.72 1761\n",
"weighted avg 0.79 0.80 0.79 1761\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=None, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100,\n",
" n_jobs=None, oob_score=False, random_state=None,\n",
" verbose=0, warm_start=False)"
]
},
"metadata": {
"tags": []
},
"execution_count": 98
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "WGRmWRDLuRoQ",
"colab_type": "code",
"colab": {},
"outputId": "fcae5a4d-4cd8-4830-89d4-07b542941413"
},
"source": [
"# We looked up AdaBoostClassifier() and discovered the n_estimators hyperparameter, similar to the n_estimators of Random Forests. We tried several out and came up with the following result for n_estimators=250:\n",
"confusion(AdaBoostClassifier(n_estimators=250))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Confusion Matrix: [[1151 123]\n",
" [ 225 262]]\n",
"Classfication Report: precision recall f1-score support\n",
"\n",
" 0 0.84 0.90 0.87 1274\n",
" 1 0.68 0.54 0.60 487\n",
"\n",
" accuracy 0.80 1761\n",
" macro avg 0.76 0.72 0.73 1761\n",
"weighted avg 0.79 0.80 0.79 1761\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n",
" n_estimators=250, random_state=None)"
]
},
"metadata": {
"tags": []
},
"execution_count": 97
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "q_vmM7xeuRoS",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment