Dataaspirant-XGBoost-Iris-classification.ipynb
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Dataaspirant-XGBoost-Iris-classification.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/saimadhu-polamuri/a2c1fd3bd76c77328994af3c8298698a/dataaspirant-xgboost-iris-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "M4qSfmQkMUFL" | |
}, | |
"source": [ | |
"## Using XGBoost for Classification Problem Overiew in Python 3.x\n", | |
"Pipeline: \n", | |
"1. Import the libraries/modules needed\n", | |
"2. Import data\n", | |
"3. Data cleaning and pre-processing\n", | |
"4. Train-test split\n", | |
"5. XGBoost training and prediction\n", | |
"6. Model Evaluation" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "HbjZySLAMsot" | |
}, | |
"source": [ | |
"###1. Import the libraries/modules needed" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "SCWBNAT5qVgk" | |
}, | |
"source": [ | |
"# import the libraries needed\n", | |
"import xgboost as xgb\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import sklearn" | |
], | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3JVGZMJsM4LG" | |
}, | |
"source": [ | |
"###2. Import data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "sXDE2yiZqaiv" | |
}, | |
"source": [ | |
"from sklearn.datasets import load_iris\n", | |
"iris = load_iris()" | |
], | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "9oQbiqT0rP4I", | |
"outputId": "efcb1ec0-1324-40a4-ea25-dfcc19b0fdb0", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"data = pd.DataFrame(iris.data)\n", | |
"data.columns = iris.feature_names\n", | |
"print(data.sample(10))" | |
], | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", | |
"19 5.1 3.8 1.5 0.3\n", | |
"86 6.7 3.1 4.7 1.5\n", | |
"36 5.5 3.5 1.3 0.2\n", | |
"27 5.2 3.5 1.5 0.2\n", | |
"83 6.0 2.7 5.1 1.6\n", | |
"74 6.4 2.9 4.3 1.3\n", | |
"129 7.2 3.0 5.8 1.6\n", | |
"68 6.2 2.2 4.5 1.5\n", | |
"131 7.9 3.8 6.4 2.0\n", | |
"140 6.7 3.1 5.6 2.4\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "pfFXkLomNRMP" | |
}, | |
"source": [ | |
"###3. Data pre-processing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "AvyyajbGrZH5" | |
}, | |
"source": [ | |
"## Extract the data to the required variables\n", | |
"X = iris.data\n", | |
"y = iris.target" | |
], | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ubDPk4btup1p", | |
"outputId": "d942befd-f06a-4843-e78e-0aaba3e71bac", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"iris.data.shape" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(150, 4)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 5 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "2Vw_tj50NVp3" | |
}, | |
"source": [ | |
"###4. Train-test split" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NaAAbL96re8P" | |
}, | |
"source": [ | |
"## split the data into train and test set. The test size here is 30% of the data\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)" | |
], | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "d4ATnBCXrfLI" | |
}, | |
"source": [ | |
"dtrain = xgb.DMatrix(X_train, label=y_train)\n", | |
"dtest = xgb.DMatrix(X_test, label=y_test)" | |
], | |
"execution_count": 6, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "PkfbyllWu2at", | |
"outputId": "c5498079-611f-4a92-9085-c37b195f200b", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"y_test.shape" | |
], | |
"execution_count": 7, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(45,)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 7 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "oMwwxZtSaQis" | |
}, | |
"source": [ | |
"## Hyperparameter Tuning" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ZchkWNeLsjKS" | |
}, | |
"source": [ | |
"# some hyperparameter tuning\n", | |
"param = {\n", | |
" 'max_depth': 3, # the maximum depth of each tree\n", | |
" 'eta': 0.3, # the training step for each iteration\n", | |
" 'silent': 1, # logging mode - quiet\n", | |
" 'objective': 'multi:softprob', # error evaluation for multiclass training\n", | |
" 'num_class': 3} # the number of classes that exist in this datset\n", | |
"num_round = 20 # the number of training iterations" | |
], | |
"execution_count": 8, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ePU05kkxMeCV" | |
}, | |
"source": [ | |
"###5. XGBoost training and prediction" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "lMjkxUAns9ha" | |
}, | |
"source": [ | |
"bst = xgb.train(param, dtrain, num_round)" | |
], | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Qbz51HPxs9k0", | |
"outputId": "736c044b-56f9-435d-ccb9-ec46d2c51bdf", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"preds = bst.predict(dtest)\n", | |
"preds.shape" | |
], | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(45, 3)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 10 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mxHFARIssju3", | |
"outputId": "aa09a27e-6584-4b94-c4c3-50a2b54686e0", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"import numpy as np\n", | |
"best_preds = np.asarray([np.argmax(line) for line in preds])\n", | |
"print(best_preds)" | |
], | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1\n", | |
" 0 0 0 2 1 1 0 0]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "YMztb0avMeZy", | |
"outputId": "4bc71cad-5eb8-42cb-af0d-ead9115aaa23", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"best_preds.shape" | |
], | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(45,)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 12 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "oQfMfCbyMhnW" | |
}, | |
"source": [ | |
"##6. Model Evaluation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ey7UNw30MUZI", | |
"outputId": "bd6dc269-e49f-4a8f-9039-635d1f8e09ad", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"source": [ | |
"from sklearn.metrics import precision_score, f1_score\n", | |
"print(precision_score(y_test, best_preds, average='macro'))\n", | |
"print(f1_score(y_test, best_preds, average='weighted'))\n" | |
], | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"1.0\n", | |
"1.0\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "BoZY_M9qaYdf" | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment