Last active
June 3, 2019 03:20
-
-
Save VictoriaMaia/767db6344c9fb56d631d368e3c7ac661 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Problema retirado desse site https://archive.ics.uci.edu/ml/datasets/Iris" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Baseado em características de uma flor, vamos classificar qual tipo de flor ela é." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Vamos resolver esse problema de classificação usando o SVM (support vector machine)\n", | |
"\n", | |
"Mais informações esse site pode ser útil: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"from sklearn.svm import SVC\n", | |
"from sklearn.model_selection import cross_val_score \n", | |
"from sklearn import metrics as mt\n", | |
"from sklearn.preprocessing import StandardScaler" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Colocando labels para visualizar melhor o dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"colunas = ['s_length',\n", | |
" 's_width',\n", | |
" 'p_length',\n", | |
" 'p_width',\n", | |
" 'y']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"data = pd.read_csv('iris.data', names=colunas)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>s_length</th>\n", | |
" <th>s_width</th>\n", | |
" <th>p_length</th>\n", | |
" <th>p_width</th>\n", | |
" <th>y</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>5.1</td>\n", | |
" <td>3.5</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" <td>Iris-setosa</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>4.9</td>\n", | |
" <td>3.0</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" <td>Iris-setosa</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4.7</td>\n", | |
" <td>3.2</td>\n", | |
" <td>1.3</td>\n", | |
" <td>0.2</td>\n", | |
" <td>Iris-setosa</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>4.6</td>\n", | |
" <td>3.1</td>\n", | |
" <td>1.5</td>\n", | |
" <td>0.2</td>\n", | |
" <td>Iris-setosa</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>5.0</td>\n", | |
" <td>3.6</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" <td>Iris-setosa</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" s_length s_width p_length p_width y\n", | |
"0 5.1 3.5 1.4 0.2 Iris-setosa\n", | |
"1 4.9 3.0 1.4 0.2 Iris-setosa\n", | |
"2 4.7 3.2 1.3 0.2 Iris-setosa\n", | |
"3 4.6 3.1 1.5 0.2 Iris-setosa\n", | |
"4 5.0 3.6 1.4 0.2 Iris-setosa" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"s_length 0\n", | |
"s_width 0\n", | |
"p_length 0\n", | |
"p_width 0\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.isnull().sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Como não temos dados faltantes e não precisamos alterar nada no dataset, podemos dividir os dados" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"y = data['y'].values\n", | |
"data = data.drop('y', axis=1) \n", | |
"X = data.values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Criando o modelo e utilizando o kernel. Sendo Kernel uma função do svm para reorganizar os dados de uma forma que consiga construir um plano capaz de dividir os dados." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"svm = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n", | |
" decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',\n", | |
" max_iter=-1, probability=False, random_state=1, shrinking=True,\n", | |
" tol=0.001, verbose=False)" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"svm.fit(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Agora vamos fazer com cross validation. Uma técnica para poder verificar como está a resposta do nosso modelo sem causar um overfitting. Essa técnica conssiste em pegar o conjunto de treino e fazer vários conjuntos com diferentes partes retiradas, como pode ser visto na próxima figura" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"![Drag Racing](crossover.png)\n", | |
"\n", | |
"Imagem retirada do site: https://scikit-learn.org/stable/modules/cross_validation.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"scores = cross_val_score(svm, X_train, y_train, cv=10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([1. , 1. , 1. , 1. , 0.91666667,\n", | |
" 1. , 1. , 1. , 0.91666667, 1. ])" | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"scores" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Acurácia cross validation: 0.98 (+/- 0.07)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Acurácia cross validation: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"y_pred = svm.predict(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Acurácia: 0.97\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Acurácia: %0.2f\" % mt.accuracy_score(y_test, y_pred))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Fazendo standartização dos dados e verificando se os resultados melhoram" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"stdsc = StandardScaler()\n", | |
"X_train_std = stdsc.fit_transform(X_train)\n", | |
"X_test_std = stdsc.transform(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"svm_std = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n", | |
" decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',\n", | |
" max_iter=-1, probability=False, random_state=1, shrinking=True,\n", | |
" tol=0.001, verbose=False)" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"svm_std.fit(X_train_std, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"scores_std = cross_val_score(svm_std, X_train_std, y_train, cv=10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([0.83333333, 1. , 1. , 0.91666667, 0.91666667,\n", | |
" 1. , 0.91666667, 1. , 1. , 1. ])" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"scores_std" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Acurácia cross validation:: 0.96 (+/- 0.11)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Acurácia cross validation:: %0.2f (+/- %0.2f)\" % (scores_std.mean(), scores_std.std() * 2))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"y_predStd = svm_std.predict(X_test_std)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Acurácia: 0.97\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Acurácia: %0.2f\" % mt.accuracy_score(y_test, y_predStd))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment