saimadhu-polamuri/dataaspirant-xgboost-iris-classification.ipynb

## dataaspirant-xgboost-iris-classification.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Dataaspirant-XGBoost-Iris-classification.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a/dataaspirant-xgboost-iris-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M4qSfmQkMUFL"
      },
      "source": [
        "## Using XGBoost for Classification Problem Overiew in Python 3.x\n",
        "Pipeline: \n",
        "1. Import the libraries/modules needed\n",
        "2. Import data\n",
        "3. Data cleaning and pre-processing\n",
        "4. Train-test split\n",
        "5. XGBoost training and prediction\n",
        "6. Model Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HbjZySLAMsot"
      },
      "source": [
        "###1. Import the libraries/modules needed"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SCWBNAT5qVgk"
      },
      "source": [
        "# import the libraries needed\n",
        "import xgboost as xgb\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import sklearn"
      ],
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3JVGZMJsM4LG"
      },
      "source": [
        "###2. Import data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sXDE2yiZqaiv"
      },
      "source": [
        "from sklearn.datasets import load_iris\n",
        "iris = load_iris()"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9oQbiqT0rP4I",
        "outputId": "efcb1ec0-1324-40a4-ea25-dfcc19b0fdb0",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "data = pd.DataFrame(iris.data)\n",
        "data.columns = iris.feature_names\n",
        "print(data.sample(10))"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
            "19                 5.1               3.8                1.5               0.3\n",
            "86                 6.7               3.1                4.7               1.5\n",
            "36                 5.5               3.5                1.3               0.2\n",
            "27                 5.2               3.5                1.5               0.2\n",
            "83                 6.0               2.7                5.1               1.6\n",
            "74                 6.4               2.9                4.3               1.3\n",
            "129                7.2               3.0                5.8               1.6\n",
            "68                 6.2               2.2                4.5               1.5\n",
            "131                7.9               3.8                6.4               2.0\n",
            "140                6.7               3.1                5.6               2.4\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pfFXkLomNRMP"
      },
      "source": [
        "###3. Data pre-processing"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AvyyajbGrZH5"
      },
      "source": [
        "## Extract the data to the required variables\n",
        "X = iris.data\n",
        "y = iris.target"
      ],
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ubDPk4btup1p",
        "outputId": "d942befd-f06a-4843-e78e-0aaba3e71bac",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "iris.data.shape"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(150, 4)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2Vw_tj50NVp3"
      },
      "source": [
        "###4. Train-test split"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NaAAbL96re8P"
      },
      "source": [
        "## split the data into train and test set. The test size here is 30% of the data\n",
        "from sklearn.model_selection import train_test_split\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
      ],
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d4ATnBCXrfLI"
      },
      "source": [
        "dtrain = xgb.DMatrix(X_train, label=y_train)\n",
        "dtest = xgb.DMatrix(X_test, label=y_test)"
      ],
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PkfbyllWu2at",
        "outputId": "c5498079-611f-4a92-9085-c37b195f200b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "y_test.shape"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(45,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oMwwxZtSaQis"
      },
      "source": [
        "## Hyperparameter Tuning"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZchkWNeLsjKS"
      },
      "source": [
        "# some hyperparameter tuning\n",
        "param = {\n",
        "    'max_depth': 3,  # the maximum depth of each tree\n",
        "    'eta': 0.3,  # the training step for each iteration\n",
        "    'silent': 1,  # logging mode - quiet\n",
        "    'objective': 'multi:softprob',  # error evaluation for multiclass training\n",
        "    'num_class': 3}  # the number of classes that exist in this datset\n",
        "num_round = 20  # the number of training iterations"
      ],
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ePU05kkxMeCV"
      },
      "source": [
        "###5. XGBoost training and prediction"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lMjkxUAns9ha"
      },
      "source": [
        "bst = xgb.train(param, dtrain, num_round)"
      ],
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Qbz51HPxs9k0",
        "outputId": "736c044b-56f9-435d-ccb9-ec46d2c51bdf",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "preds = bst.predict(dtest)\n",
        "preds.shape"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(45, 3)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mxHFARIssju3",
        "outputId": "aa09a27e-6584-4b94-c4c3-50a2b54686e0",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "import numpy as np\n",
        "best_preds = np.asarray([np.argmax(line) for line in preds])\n",
        "print(best_preds)"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1\n",
            " 0 0 0 2 1 1 0 0]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YMztb0avMeZy",
        "outputId": "4bc71cad-5eb8-42cb-af0d-ead9115aaa23",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "best_preds.shape"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(45,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oQfMfCbyMhnW"
      },
      "source": [
        "##6. Model Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ey7UNw30MUZI",
        "outputId": "bd6dc269-e49f-4a8f-9039-635d1f8e09ad",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "from sklearn.metrics import precision_score, f1_score\n",
        "print(precision_score(y_test, best_preds, average='macro'))\n",
        "print(f1_score(y_test, best_preds, average='weighted'))\n"
      ],
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "1.0\n",
            "1.0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BoZY_M9qaYdf"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Dataaspirant-XGBoost-Iris-classification.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a/dataaspirant-xgboost-iris-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "M4qSfmQkMUFL"
	},
	"source": [
	"## Using XGBoost for Classification Problem Overiew in Python 3.x\n",
	"Pipeline: \n",
	"1. Import the libraries/modules needed\n",
	"2. Import data\n",
	"3. Data cleaning and pre-processing\n",
	"4. Train-test split\n",
	"5. XGBoost training and prediction\n",
	"6. Model Evaluation"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "HbjZySLAMsot"
	},
	"source": [
	"###1. Import the libraries/modules needed"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "SCWBNAT5qVgk"
	},
	"source": [
	"# import the libraries needed\n",
	"import xgboost as xgb\n",
	"import numpy as np\n",
	"import pandas as pd\n",
	"import sklearn"
	],
	"execution_count": 1,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "3JVGZMJsM4LG"
	},
	"source": [
	"###2. Import data"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "sXDE2yiZqaiv"
	},
	"source": [
	"from sklearn.datasets import load_iris\n",
	"iris = load_iris()"
	],
	"execution_count": 2,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "9oQbiqT0rP4I",
	"outputId": "efcb1ec0-1324-40a4-ea25-dfcc19b0fdb0",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"data = pd.DataFrame(iris.data)\n",
	"data.columns = iris.feature_names\n",
	"print(data.sample(10))"
	],
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
	"19 5.1 3.8 1.5 0.3\n",
	"86 6.7 3.1 4.7 1.5\n",
	"36 5.5 3.5 1.3 0.2\n",
	"27 5.2 3.5 1.5 0.2\n",
	"83 6.0 2.7 5.1 1.6\n",
	"74 6.4 2.9 4.3 1.3\n",
	"129 7.2 3.0 5.8 1.6\n",
	"68 6.2 2.2 4.5 1.5\n",
	"131 7.9 3.8 6.4 2.0\n",
	"140 6.7 3.1 5.6 2.4\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "pfFXkLomNRMP"
	},
	"source": [
	"###3. Data pre-processing"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "AvyyajbGrZH5"
	},
	"source": [
	"## Extract the data to the required variables\n",
	"X = iris.data\n",
	"y = iris.target"
	],
	"execution_count": 4,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ubDPk4btup1p",
	"outputId": "d942befd-f06a-4843-e78e-0aaba3e71bac",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"iris.data.shape"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(150, 4)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 5
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "2Vw_tj50NVp3"
	},
	"source": [
	"###4. Train-test split"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "NaAAbL96re8P"
	},
	"source": [
	"## split the data into train and test set. The test size here is 30% of the data\n",
	"from sklearn.model_selection import train_test_split\n",
	"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
	],
	"execution_count": 5,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "d4ATnBCXrfLI"
	},
	"source": [
	"dtrain = xgb.DMatrix(X_train, label=y_train)\n",
	"dtest = xgb.DMatrix(X_test, label=y_test)"
	],
	"execution_count": 6,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "PkfbyllWu2at",
	"outputId": "c5498079-611f-4a92-9085-c37b195f200b",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"y_test.shape"
	],
	"execution_count": 7,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(45,)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 7
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "oMwwxZtSaQis"
	},
	"source": [
	"## Hyperparameter Tuning"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ZchkWNeLsjKS"
	},
	"source": [
	"# some hyperparameter tuning\n",
	"param = {\n",
	" 'max_depth': 3, # the maximum depth of each tree\n",
	" 'eta': 0.3, # the training step for each iteration\n",
	" 'silent': 1, # logging mode - quiet\n",
	" 'objective': 'multi:softprob', # error evaluation for multiclass training\n",
	" 'num_class': 3} # the number of classes that exist in this datset\n",
	"num_round = 20 # the number of training iterations"
	],
	"execution_count": 8,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ePU05kkxMeCV"
	},
	"source": [
	"###5. XGBoost training and prediction"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "lMjkxUAns9ha"
	},
	"source": [
	"bst = xgb.train(param, dtrain, num_round)"
	],
	"execution_count": 9,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Qbz51HPxs9k0",
	"outputId": "736c044b-56f9-435d-ccb9-ec46d2c51bdf",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"preds = bst.predict(dtest)\n",
	"preds.shape"
	],
	"execution_count": 10,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(45, 3)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 10
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "mxHFARIssju3",
	"outputId": "aa09a27e-6584-4b94-c4c3-50a2b54686e0",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"import numpy as np\n",
	"best_preds = np.asarray([np.argmax(line) for line in preds])\n",
	"print(best_preds)"
	],
	"execution_count": 11,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1\n",
	" 0 0 0 2 1 1 0 0]\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "YMztb0avMeZy",
	"outputId": "4bc71cad-5eb8-42cb-af0d-ead9115aaa23",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"best_preds.shape"
	],
	"execution_count": 12,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(45,)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 12
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "oQfMfCbyMhnW"
	},
	"source": [
	"##6. Model Evaluation"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Ey7UNw30MUZI",
	"outputId": "bd6dc269-e49f-4a8f-9039-635d1f8e09ad",
	"colab": {
	"base_uri": "https://localhost:8080/"
	}
	},
	"source": [
	"from sklearn.metrics import precision_score, f1_score\n",
	"print(precision_score(y_test, best_preds, average='macro'))\n",
	"print(f1_score(y_test, best_preds, average='weighted'))\n"
	],
	"execution_count": 13,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"1.0\n",
	"1.0\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "BoZY_M9qaYdf"
	},
	"source": [
	""
	],
	"execution_count": null,
	"outputs": []
	}
	]
	}