Skip to content

Instantly share code, notes, and snippets.

@VictoriaMaia
Last active June 3, 2019 03:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save VictoriaMaia/767db6344c9fb56d631d368e3c7ac661 to your computer and use it in GitHub Desktop.
Save VictoriaMaia/767db6344c9fb56d631d368e3c7ac661 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problema retirado desse site https://archive.ics.uci.edu/ml/datasets/Iris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Baseado em características de uma flor, vamos classificar qual tipo de flor ela é."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vamos resolver esse problema de classificação usando o SVM (support vector machine)\n",
"\n",
"Mais informações esse site pode ser útil: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.svm import SVC\n",
"from sklearn.model_selection import cross_val_score \n",
"from sklearn import metrics as mt\n",
"from sklearn.preprocessing import StandardScaler"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Colocando labels para visualizar melhor o dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"colunas = ['s_length',\n",
" 's_width',\n",
" 'p_length',\n",
" 'p_width',\n",
" 'y']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('iris.data', names=colunas)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>s_length</th>\n",
" <th>s_width</th>\n",
" <th>p_length</th>\n",
" <th>p_width</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" s_length s_width p_length p_width y\n",
"0 5.1 3.5 1.4 0.2 Iris-setosa\n",
"1 4.9 3.0 1.4 0.2 Iris-setosa\n",
"2 4.7 3.2 1.3 0.2 Iris-setosa\n",
"3 4.6 3.1 1.5 0.2 Iris-setosa\n",
"4 5.0 3.6 1.4 0.2 Iris-setosa"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"s_length 0\n",
"s_width 0\n",
"p_length 0\n",
"p_width 0\n",
"dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Como não temos dados faltantes e não precisamos alterar nada no dataset, podemos dividir os dados"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"y = data['y'].values\n",
"data = data.drop('y', axis=1) \n",
"X = data.values"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Criando o modelo e utilizando o kernel. Sendo Kernel uma função do svm para reorganizar os dados de uma forma que consiga construir um plano capaz de dividir os dados."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"svm = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=1, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"svm.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agora vamos fazer com cross validation. Uma técnica para poder verificar como está a resposta do nosso modelo sem causar um overfitting. Essa técnica conssiste em pegar o conjunto de treino e fazer vários conjuntos com diferentes partes retiradas, como pode ser visto na próxima figura"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Drag Racing](crossover.png)\n",
"\n",
"Imagem retirada do site: https://scikit-learn.org/stable/modules/cross_validation.html"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"scores = cross_val_score(svm, X_train, y_train, cv=10)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1. , 1. , 1. , 1. , 0.91666667,\n",
" 1. , 1. , 1. , 0.91666667, 1. ])"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Acurácia cross validation: 0.98 (+/- 0.07)\n"
]
}
],
"source": [
"print(\"Acurácia cross validation: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"y_pred = svm.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Acurácia: 0.97\n"
]
}
],
"source": [
"print(\"Acurácia: %0.2f\" % mt.accuracy_score(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fazendo standartização dos dados e verificando se os resultados melhoram"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"stdsc = StandardScaler()\n",
"X_train_std = stdsc.fit_transform(X_train)\n",
"X_test_std = stdsc.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"svm_std = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=1, shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"svm_std.fit(X_train_std, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"scores_std = cross_val_score(svm_std, X_train_std, y_train, cv=10)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.83333333, 1. , 1. , 0.91666667, 0.91666667,\n",
" 1. , 0.91666667, 1. , 1. , 1. ])"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores_std"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Acurácia cross validation:: 0.96 (+/- 0.11)\n"
]
}
],
"source": [
"print(\"Acurácia cross validation:: %0.2f (+/- %0.2f)\" % (scores_std.mean(), scores_std.std() * 2))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"y_predStd = svm_std.predict(X_test_std)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Acurácia: 0.97\n"
]
}
],
"source": [
"print(\"Acurácia: %0.2f\" % mt.accuracy_score(y_test, y_predStd))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment