saimadhu-polamuri/dataaspirant-xgboost-boston-housing-price-prediction.ipynb

## dataaspirant-xgboost-boston-housing-price-prediction.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Dataaspirant-XGBoost-Boston-Housing-Price-Prediction.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/92f91ad5b7a3931154e236918931f4a7/dataaspirant-xgboost-boston-housing-price-prediction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kgxp85MHjyLf"
      },
      "source": [
        "XGBoost for Classification Problem Overiew in Python 3.x\n",
        "Pipeline: \n",
        "1. Import the libraries/modules needed\n",
        "2. Import data\n",
        "3. Data cleaning and pre-processing\n",
        "4. Train-test split\n",
        "5. XGBoost training and prediction\n",
        "6. Model Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rZqliePiepxq"
      },
      "source": [
        "## Import the libraries/modules needed"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Lpmc6xllkJ0-"
      },
      "source": [
        "## import the libraries needed\n",
        "import pandas as pd\n",
        "import numpy as np"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-ffO3aAHew0L"
      },
      "source": [
        "## Import data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9Wmpq1HqkUoK"
      },
      "source": [
        "## Import the dataset from scikit-learn library, and assign to a variable\n",
        "from sklearn.datasets import load_boston\n",
        "boston = load_boston()\n",
        "## If you have another practice dataset import at this step"
      ],
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1B3OelIie8I6"
      },
      "source": [
        "## Data cleaning and pre-processing"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jAWUxwmoksce"
      },
      "source": [
        "## assign your target\n",
        "boston['PRICE'] = boston.target "
      ],
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lwK91yRxm3BG"
      },
      "source": [
        "## assign the data to target and independent variables\n",
        "X = boston.data\n",
        "y = boston['PRICE']"
      ],
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6Y5kwvRUfBnP"
      },
      "source": [
        "## Train-test split"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r7XiMTcElQQd"
      },
      "source": [
        "## split the data into train and test set. The test size here is 30% of the data\n",
        "from sklearn.model_selection import train_test_split\n",
        "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)"
      ],
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1wGBnVmBfHGC"
      },
      "source": [
        "## XGBoost training and prediction"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "q7xv1dsYlSk_",
        "outputId": "00314454-805d-46f1-dc6c-5d5c4743b010",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "## import xgboost regressor algorithm and fit the model\n",
        "from xgboost import XGBRegressor\n",
        "xgb = XGBRegressor()\n",
        "xgb.fit(X_train, y_train)"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[15:20:29] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
              "             colsample_bynode=1, colsample_bytree=1, gamma=0,\n",
              "             importance_type='gain', learning_rate=0.1, max_delta_step=0,\n",
              "             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,\n",
              "             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n",
              "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n",
              "             silent=None, subsample=1, verbosity=1)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0UNg_XTpld-T"
      },
      "source": [
        "## After training the model, make a prediction on the train data\n",
        "y_pred = xgb.predict(X_train)"
      ],
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8_BcfmP0fOtU"
      },
      "source": [
        "## Model Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zNX8iUconT3O",
        "outputId": "6551266a-9f73-46a5-e51c-bfb0ecab8197",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "## import metrics to evaluate the performance of the XGBoost model\n",
        "from sklearn import metrics\n",
        "print('R^2:',metrics.r2_score(y_train, y_pred))\n",
        "print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))\n",
        "print('MAE:',metrics.mean_absolute_error(y_train, y_pred))\n",
        "print('MSE:',metrics.mean_squared_error(y_train, y_pred))\n",
        "print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))"
      ],
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "R^2: 0.9703652512761263\n",
            "Adjusted R^2: 0.9692321579425663\n",
            "MAE: 1.1372202838208043\n",
            "MSE: 2.230632123289034\n",
            "RMSE: 1.4935300878419002\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jox6hGl5opSF"
      },
      "source": [
        "## Appply the model to the test set\n",
        "y_test_pred = xgb.predict(X_test)"
      ],
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "afxZ1a14opwq",
        "outputId": "6af1e52d-d814-4596-c374-72700cd881a1",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "## Evaluate the performance of the model on the test set\n",
        "acc_xgb = metrics.r2_score(y_test, y_test_pred)\n",
        "print('R^2:', acc_xgb)\n",
        "print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))\n",
        "print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))\n",
        "print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))\n",
        "print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "R^2: 0.8494894736313225\n",
            "Adjusted R^2: 0.8353109457849979\n",
            "MAE: 2.4509708843733136\n",
            "MSE: 15.716320042597493\n",
            "RMSE: 3.9643814199188117\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YN0OhXatPMIx",
        "outputId": "81b5a04d-bbe7-4655-ffba-ad09f889534e",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "y_test_pred.shape"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(152,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 12
        }
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Dataaspirant-XGBoost-Boston-Housing-Price-Prediction.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/92f91ad5b7a3931154e236918931f4a7/dataaspirant-xgboost-boston-housing-price-prediction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "kgxp85MHjyLf"
	},
	"source": [
	"XGBoost for Classification Problem Overiew in Python 3.x\n",
	"Pipeline: \n",
	"1. Import the libraries/modules needed\n",
	"2. Import data\n",
	"3. Data cleaning and pre-processing\n",
	"4. Train-test split\n",
	"5. XGBoost training and prediction\n",
	"6. Model Evaluation"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "rZqliePiepxq"
	},
	"source": [
	"## Import the libraries/modules needed"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Lpmc6xllkJ0-"
	},
	"source": [
	"## import the libraries needed\n",
	"import pandas as pd\n",
	"import numpy as np"
	],
	"execution_count": 2,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "-ffO3aAHew0L"
	},
	"source": [
	"## Import data"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "9Wmpq1HqkUoK"
	},
	"source": [
	"## Import the dataset from scikit-learn library, and assign to a variable\n",
	"from sklearn.datasets import load_boston\n",
	"boston = load_boston()\n",
	"## If you have another practice dataset import at this step"
	],
	"execution_count": 3,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1B3OelIie8I6"
	},
	"source": [
	"## Data cleaning and pre-processing"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "jAWUxwmoksce"
	},
	"source": [
	"## assign your target\n",
	"boston['PRICE'] = boston.target "
	],
	"execution_count": 4,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "lwK91yRxm3BG"
	},
	"source": [
	"## assign the data to target and independent variables\n",
	"X = boston.data\n",
	"y = boston['PRICE']"
	],
	"execution_count": 5,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6Y5kwvRUfBnP"
	},
	"source": [
	"## Train-test split"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "r7XiMTcElQQd"
	},
	"source": [
	"## split the data into train and test set. The test size here is 30% of the data\n",
	"from sklearn.model_selection import train_test_split\n",
	"X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)"
	],
	"execution_count": 6,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1wGBnVmBfHGC"
	},
	"source": [
	"## XGBoost training and prediction"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "q7xv1dsYlSk_",
	"outputId": "00314454-805d-46f1-dc6c-5d5c4743b010",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"## import xgboost regressor algorithm and fit the model\n",
	"from xgboost import XGBRegressor\n",
	"xgb = XGBRegressor()\n",
	"xgb.fit(X_train, y_train)"
	],
	"execution_count": 7,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[15:20:29] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n"
	],
	"name": "stdout"
	},
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
	" colsample_bynode=1, colsample_bytree=1, gamma=0,\n",
	" importance_type='gain', learning_rate=0.1, max_delta_step=0,\n",
	" max_depth=3, min_child_weight=1, missing=None, n_estimators=100,\n",
	" n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n",
	" reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n",
	" silent=None, subsample=1, verbosity=1)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 7
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "0UNg_XTpld-T"
	},
	"source": [
	"## After training the model, make a prediction on the train data\n",
	"y_pred = xgb.predict(X_train)"
	],
	"execution_count": 8,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8_BcfmP0fOtU"
	},
	"source": [
	"## Model Evaluation"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zNX8iUconT3O",
	"outputId": "6551266a-9f73-46a5-e51c-bfb0ecab8197",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"## import metrics to evaluate the performance of the XGBoost model\n",
	"from sklearn import metrics\n",
	"print('R^2:',metrics.r2_score(y_train, y_pred))\n",
	"print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))\n",
	"print('MAE:',metrics.mean_absolute_error(y_train, y_pred))\n",
	"print('MSE:',metrics.mean_squared_error(y_train, y_pred))\n",
	"print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))"
	],
	"execution_count": 9,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"R^2: 0.9703652512761263\n",
	"Adjusted R^2: 0.9692321579425663\n",
	"MAE: 1.1372202838208043\n",
	"MSE: 2.230632123289034\n",
	"RMSE: 1.4935300878419002\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "jox6hGl5opSF"
	},
	"source": [
	"## Appply the model to the test set\n",
	"y_test_pred = xgb.predict(X_test)"
	],
	"execution_count": 10,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "afxZ1a14opwq",
	"outputId": "6af1e52d-d814-4596-c374-72700cd881a1",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"## Evaluate the performance of the model on the test set\n",
	"acc_xgb = metrics.r2_score(y_test, y_test_pred)\n",
	"print('R^2:', acc_xgb)\n",
	"print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))\n",
	"print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))\n",
	"print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))\n",
	"print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))"
	],
	"execution_count": 11,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"R^2: 0.8494894736313225\n",
	"Adjusted R^2: 0.8353109457849979\n",
	"MAE: 2.4509708843733136\n",
	"MSE: 15.716320042597493\n",
	"RMSE: 3.9643814199188117\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "YN0OhXatPMIx",
	"outputId": "81b5a04d-bbe7-4655-ffba-ad09f889534e",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"y_test_pred.shape"
	],
	"execution_count": 12,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(152,)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 12
	}
	]
	}
	]
	}