Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a to your computer and use it in GitHub Desktop.
Save saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a to your computer and use it in GitHub Desktop.
Dataaspirant-XGBoost-Iris-classification.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Dataaspirant-XGBoost-Iris-classification.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a/dataaspirant-xgboost-iris-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "M4qSfmQkMUFL"
},
"source": [
"## Using XGBoost for Classification Problem Overiew in Python 3.x\n",
"Pipeline: \n",
"1. Import the libraries/modules needed\n",
"2. Import data\n",
"3. Data cleaning and pre-processing\n",
"4. Train-test split\n",
"5. XGBoost training and prediction\n",
"6. Model Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HbjZySLAMsot"
},
"source": [
"###1. Import the libraries/modules needed"
]
},
{
"cell_type": "code",
"metadata": {
"id": "SCWBNAT5qVgk"
},
"source": [
"# import the libraries needed\n",
"import xgboost as xgb\n",
"import numpy as np\n",
"import pandas as pd\n",
"import sklearn"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3JVGZMJsM4LG"
},
"source": [
"###2. Import data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "sXDE2yiZqaiv"
},
"source": [
"from sklearn.datasets import load_iris\n",
"iris = load_iris()"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "9oQbiqT0rP4I",
"outputId": "efcb1ec0-1324-40a4-ea25-dfcc19b0fdb0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"data = pd.DataFrame(iris.data)\n",
"data.columns = iris.feature_names\n",
"print(data.sample(10))"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"19 5.1 3.8 1.5 0.3\n",
"86 6.7 3.1 4.7 1.5\n",
"36 5.5 3.5 1.3 0.2\n",
"27 5.2 3.5 1.5 0.2\n",
"83 6.0 2.7 5.1 1.6\n",
"74 6.4 2.9 4.3 1.3\n",
"129 7.2 3.0 5.8 1.6\n",
"68 6.2 2.2 4.5 1.5\n",
"131 7.9 3.8 6.4 2.0\n",
"140 6.7 3.1 5.6 2.4\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pfFXkLomNRMP"
},
"source": [
"###3. Data pre-processing"
]
},
{
"cell_type": "code",
"metadata": {
"id": "AvyyajbGrZH5"
},
"source": [
"## Extract the data to the required variables\n",
"X = iris.data\n",
"y = iris.target"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ubDPk4btup1p",
"outputId": "d942befd-f06a-4843-e78e-0aaba3e71bac",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"iris.data.shape"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(150, 4)"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2Vw_tj50NVp3"
},
"source": [
"###4. Train-test split"
]
},
{
"cell_type": "code",
"metadata": {
"id": "NaAAbL96re8P"
},
"source": [
"## split the data into train and test set. The test size here is 30% of the data\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
],
"execution_count": 5,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "d4ATnBCXrfLI"
},
"source": [
"dtrain = xgb.DMatrix(X_train, label=y_train)\n",
"dtest = xgb.DMatrix(X_test, label=y_test)"
],
"execution_count": 6,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "PkfbyllWu2at",
"outputId": "c5498079-611f-4a92-9085-c37b195f200b",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"y_test.shape"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(45,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oMwwxZtSaQis"
},
"source": [
"## Hyperparameter Tuning"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZchkWNeLsjKS"
},
"source": [
"# some hyperparameter tuning\n",
"param = {\n",
" 'max_depth': 3, # the maximum depth of each tree\n",
" 'eta': 0.3, # the training step for each iteration\n",
" 'silent': 1, # logging mode - quiet\n",
" 'objective': 'multi:softprob', # error evaluation for multiclass training\n",
" 'num_class': 3} # the number of classes that exist in this datset\n",
"num_round = 20 # the number of training iterations"
],
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ePU05kkxMeCV"
},
"source": [
"###5. XGBoost training and prediction"
]
},
{
"cell_type": "code",
"metadata": {
"id": "lMjkxUAns9ha"
},
"source": [
"bst = xgb.train(param, dtrain, num_round)"
],
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Qbz51HPxs9k0",
"outputId": "736c044b-56f9-435d-ccb9-ec46d2c51bdf",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"preds = bst.predict(dtest)\n",
"preds.shape"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(45, 3)"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "mxHFARIssju3",
"outputId": "aa09a27e-6584-4b94-c4c3-50a2b54686e0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"import numpy as np\n",
"best_preds = np.asarray([np.argmax(line) for line in preds])\n",
"print(best_preds)"
],
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": [
"[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1\n",
" 0 0 0 2 1 1 0 0]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "YMztb0avMeZy",
"outputId": "4bc71cad-5eb8-42cb-af0d-ead9115aaa23",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"best_preds.shape"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(45,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oQfMfCbyMhnW"
},
"source": [
"##6. Model Evaluation"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ey7UNw30MUZI",
"outputId": "bd6dc269-e49f-4a8f-9039-635d1f8e09ad",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"from sklearn.metrics import precision_score, f1_score\n",
"print(precision_score(y_test, best_preds, average='macro'))\n",
"print(f1_score(y_test, best_preds, average='weighted'))\n"
],
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": [
"1.0\n",
"1.0\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "BoZY_M9qaYdf"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment