Skip to content

Instantly share code, notes, and snippets.

@thiagolcks
Created May 8, 2018 19:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thiagolcks/8e1ed11c47bd074f04a4957fb48ae340 to your computer and use it in GitHub Desktop.
Save thiagolcks/8e1ed11c47bd074f04a4957fb48ae340 to your computer and use it in GitHub Desktop.
Questão 1 - Classificação
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Human Activity Recognition Using Smartphones"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:54.003798Z",
"start_time": "2018-05-08T18:07:50.012603Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"# load the libraries\n",
"import itertools\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import pandas as pd\n",
"from sklearn import svm\n",
"from sklearn.naive_bayes import GaussianNB, BernoulliNB\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.neural_network import MLPClassifier\n",
"from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score\n",
"from sklearn.feature_selection import VarianceThreshold\n",
"\n",
"from xgboost import XGBClassifier\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 - Preparando os Dados"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T19:49:26.061482Z",
"start_time": "2018-04-17T19:49:26.055307Z"
}
},
"source": [
"### 1.1 - Carregando e Analisando os Dados"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:56.507160Z",
"start_time": "2018-05-08T18:07:54.006400Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"X_train = pd.read_table(\"../data/UCI-HAR/train/X_train.txt\", sep='\\s+', header=None)\n",
"y_train = pd.read_table(\"../data/UCI-HAR/train/y_train.txt\", header=None, dtype='category')\n",
"\n",
"X_test = pd.read_table(\"../data/UCI-HAR/test/X_test.txt\", sep='\\s+', header=None)\n",
"y_test = pd.read_table(\"../data/UCI-HAR/test/y_test.txt\", header=None, dtype='category')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:56.522442Z",
"start_time": "2018-05-08T18:07:56.509540Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"labels = sorted(y_train[0].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Após carregar os dados vamos conferir a sua estrutura:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:56.536372Z",
"start_time": "2018-05-08T18:07:56.526042Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(7352, 561)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:56.597327Z",
"start_time": "2018-05-08T18:07:56.539365Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>551</th>\n",
" <th>552</th>\n",
" <th>553</th>\n",
" <th>554</th>\n",
" <th>555</th>\n",
" <th>556</th>\n",
" <th>557</th>\n",
" <th>558</th>\n",
" <th>559</th>\n",
" <th>560</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.288585</td>\n",
" <td>-0.020294</td>\n",
" <td>-0.132905</td>\n",
" <td>-0.995279</td>\n",
" <td>-0.983111</td>\n",
" <td>-0.913526</td>\n",
" <td>-0.995112</td>\n",
" <td>-0.983185</td>\n",
" <td>-0.923527</td>\n",
" <td>-0.934724</td>\n",
" <td>...</td>\n",
" <td>-0.074323</td>\n",
" <td>-0.298676</td>\n",
" <td>-0.710304</td>\n",
" <td>-0.112754</td>\n",
" <td>0.030400</td>\n",
" <td>-0.464761</td>\n",
" <td>-0.018446</td>\n",
" <td>-0.841247</td>\n",
" <td>0.179941</td>\n",
" <td>-0.058627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.278419</td>\n",
" <td>-0.016411</td>\n",
" <td>-0.123520</td>\n",
" <td>-0.998245</td>\n",
" <td>-0.975300</td>\n",
" <td>-0.960322</td>\n",
" <td>-0.998807</td>\n",
" <td>-0.974914</td>\n",
" <td>-0.957686</td>\n",
" <td>-0.943068</td>\n",
" <td>...</td>\n",
" <td>0.158075</td>\n",
" <td>-0.595051</td>\n",
" <td>-0.861499</td>\n",
" <td>0.053477</td>\n",
" <td>-0.007435</td>\n",
" <td>-0.732626</td>\n",
" <td>0.703511</td>\n",
" <td>-0.844788</td>\n",
" <td>0.180289</td>\n",
" <td>-0.054317</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.279653</td>\n",
" <td>-0.019467</td>\n",
" <td>-0.113462</td>\n",
" <td>-0.995380</td>\n",
" <td>-0.967187</td>\n",
" <td>-0.978944</td>\n",
" <td>-0.996520</td>\n",
" <td>-0.963668</td>\n",
" <td>-0.977469</td>\n",
" <td>-0.938692</td>\n",
" <td>...</td>\n",
" <td>0.414503</td>\n",
" <td>-0.390748</td>\n",
" <td>-0.760104</td>\n",
" <td>-0.118559</td>\n",
" <td>0.177899</td>\n",
" <td>0.100699</td>\n",
" <td>0.808529</td>\n",
" <td>-0.848933</td>\n",
" <td>0.180637</td>\n",
" <td>-0.049118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.279174</td>\n",
" <td>-0.026201</td>\n",
" <td>-0.123283</td>\n",
" <td>-0.996091</td>\n",
" <td>-0.983403</td>\n",
" <td>-0.990675</td>\n",
" <td>-0.997099</td>\n",
" <td>-0.982750</td>\n",
" <td>-0.989302</td>\n",
" <td>-0.938692</td>\n",
" <td>...</td>\n",
" <td>0.404573</td>\n",
" <td>-0.117290</td>\n",
" <td>-0.482845</td>\n",
" <td>-0.036788</td>\n",
" <td>-0.012892</td>\n",
" <td>0.640011</td>\n",
" <td>-0.485366</td>\n",
" <td>-0.848649</td>\n",
" <td>0.181935</td>\n",
" <td>-0.047663</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.276629</td>\n",
" <td>-0.016570</td>\n",
" <td>-0.115362</td>\n",
" <td>-0.998139</td>\n",
" <td>-0.980817</td>\n",
" <td>-0.990482</td>\n",
" <td>-0.998321</td>\n",
" <td>-0.979672</td>\n",
" <td>-0.990441</td>\n",
" <td>-0.942469</td>\n",
" <td>...</td>\n",
" <td>0.087753</td>\n",
" <td>-0.351471</td>\n",
" <td>-0.699205</td>\n",
" <td>0.123320</td>\n",
" <td>0.122542</td>\n",
" <td>0.693578</td>\n",
" <td>-0.615971</td>\n",
" <td>-0.847865</td>\n",
" <td>0.185151</td>\n",
" <td>-0.043892</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 561 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 \n",
"1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 \n",
"2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 \n",
"3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 \n",
"4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 \n",
"\n",
" 7 8 9 ... 551 552 553 \\\n",
"0 -0.983185 -0.923527 -0.934724 ... -0.074323 -0.298676 -0.710304 \n",
"1 -0.974914 -0.957686 -0.943068 ... 0.158075 -0.595051 -0.861499 \n",
"2 -0.963668 -0.977469 -0.938692 ... 0.414503 -0.390748 -0.760104 \n",
"3 -0.982750 -0.989302 -0.938692 ... 0.404573 -0.117290 -0.482845 \n",
"4 -0.979672 -0.990441 -0.942469 ... 0.087753 -0.351471 -0.699205 \n",
"\n",
" 554 555 556 557 558 559 560 \n",
"0 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627 \n",
"1 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317 \n",
"2 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118 \n",
"3 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663 \n",
"4 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892 \n",
"\n",
"[5 rows x 561 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.141736Z",
"start_time": "2018-05-08T18:07:56.602920Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>551</th>\n",
" <th>552</th>\n",
" <th>553</th>\n",
" <th>554</th>\n",
" <th>555</th>\n",
" <th>556</th>\n",
" <th>557</th>\n",
" <th>558</th>\n",
" <th>559</th>\n",
" <th>560</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>...</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" <td>7352.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.274488</td>\n",
" <td>-0.017695</td>\n",
" <td>-0.109141</td>\n",
" <td>-0.605438</td>\n",
" <td>-0.510938</td>\n",
" <td>-0.604754</td>\n",
" <td>-0.630512</td>\n",
" <td>-0.526907</td>\n",
" <td>-0.606150</td>\n",
" <td>-0.468604</td>\n",
" <td>...</td>\n",
" <td>0.125293</td>\n",
" <td>-0.307009</td>\n",
" <td>-0.625294</td>\n",
" <td>0.008684</td>\n",
" <td>0.002186</td>\n",
" <td>0.008726</td>\n",
" <td>-0.005981</td>\n",
" <td>-0.489547</td>\n",
" <td>0.058593</td>\n",
" <td>-0.056515</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.070261</td>\n",
" <td>0.040811</td>\n",
" <td>0.056635</td>\n",
" <td>0.448734</td>\n",
" <td>0.502645</td>\n",
" <td>0.418687</td>\n",
" <td>0.424073</td>\n",
" <td>0.485942</td>\n",
" <td>0.414122</td>\n",
" <td>0.544547</td>\n",
" <td>...</td>\n",
" <td>0.250994</td>\n",
" <td>0.321011</td>\n",
" <td>0.307584</td>\n",
" <td>0.336787</td>\n",
" <td>0.448306</td>\n",
" <td>0.608303</td>\n",
" <td>0.477975</td>\n",
" <td>0.511807</td>\n",
" <td>0.297480</td>\n",
" <td>0.279122</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-0.999873</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000</td>\n",
" <td>-0.995357</td>\n",
" <td>-0.999765</td>\n",
" <td>-0.976580</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.262975</td>\n",
" <td>-0.024863</td>\n",
" <td>-0.120993</td>\n",
" <td>-0.992754</td>\n",
" <td>-0.978129</td>\n",
" <td>-0.980233</td>\n",
" <td>-0.993591</td>\n",
" <td>-0.978162</td>\n",
" <td>-0.980251</td>\n",
" <td>-0.936219</td>\n",
" <td>...</td>\n",
" <td>-0.023692</td>\n",
" <td>-0.542602</td>\n",
" <td>-0.845573</td>\n",
" <td>-0.121527</td>\n",
" <td>-0.289549</td>\n",
" <td>-0.482273</td>\n",
" <td>-0.376341</td>\n",
" <td>-0.812065</td>\n",
" <td>-0.017885</td>\n",
" <td>-0.143414</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.277193</td>\n",
" <td>-0.017219</td>\n",
" <td>-0.108676</td>\n",
" <td>-0.946196</td>\n",
" <td>-0.851897</td>\n",
" <td>-0.859365</td>\n",
" <td>-0.950709</td>\n",
" <td>-0.857328</td>\n",
" <td>-0.857143</td>\n",
" <td>-0.881637</td>\n",
" <td>...</td>\n",
" <td>0.134000</td>\n",
" <td>-0.343685</td>\n",
" <td>-0.711692</td>\n",
" <td>0.009509</td>\n",
" <td>0.008943</td>\n",
" <td>0.008735</td>\n",
" <td>-0.000368</td>\n",
" <td>-0.709417</td>\n",
" <td>0.182071</td>\n",
" <td>0.003181</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.288461</td>\n",
" <td>-0.010783</td>\n",
" <td>-0.097794</td>\n",
" <td>-0.242813</td>\n",
" <td>-0.034231</td>\n",
" <td>-0.262415</td>\n",
" <td>-0.292680</td>\n",
" <td>-0.066701</td>\n",
" <td>-0.265671</td>\n",
" <td>-0.017129</td>\n",
" <td>...</td>\n",
" <td>0.289096</td>\n",
" <td>-0.126979</td>\n",
" <td>-0.503878</td>\n",
" <td>0.150865</td>\n",
" <td>0.292861</td>\n",
" <td>0.506187</td>\n",
" <td>0.359368</td>\n",
" <td>-0.509079</td>\n",
" <td>0.248353</td>\n",
" <td>0.107659</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.916238</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.967664</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>0.946700</td>\n",
" <td>0.989538</td>\n",
" <td>0.956845</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.998702</td>\n",
" <td>0.996078</td>\n",
" <td>1.000000</td>\n",
" <td>0.478157</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 561 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 \\\n",
"count 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 \n",
"mean 0.274488 -0.017695 -0.109141 -0.605438 -0.510938 \n",
"std 0.070261 0.040811 0.056635 0.448734 0.502645 \n",
"min -1.000000 -1.000000 -1.000000 -1.000000 -0.999873 \n",
"25% 0.262975 -0.024863 -0.120993 -0.992754 -0.978129 \n",
"50% 0.277193 -0.017219 -0.108676 -0.946196 -0.851897 \n",
"75% 0.288461 -0.010783 -0.097794 -0.242813 -0.034231 \n",
"max 1.000000 1.000000 1.000000 1.000000 0.916238 \n",
"\n",
" 5 6 7 8 9 \\\n",
"count 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 \n",
"mean -0.604754 -0.630512 -0.526907 -0.606150 -0.468604 \n",
"std 0.418687 0.424073 0.485942 0.414122 0.544547 \n",
"min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 \n",
"25% -0.980233 -0.993591 -0.978162 -0.980251 -0.936219 \n",
"50% -0.859365 -0.950709 -0.857328 -0.857143 -0.881637 \n",
"75% -0.262415 -0.292680 -0.066701 -0.265671 -0.017129 \n",
"max 1.000000 1.000000 0.967664 1.000000 1.000000 \n",
"\n",
" ... 551 552 553 554 \\\n",
"count ... 7352.000000 7352.000000 7352.000000 7352.000000 \n",
"mean ... 0.125293 -0.307009 -0.625294 0.008684 \n",
"std ... 0.250994 0.321011 0.307584 0.336787 \n",
"min ... -1.000000 -0.995357 -0.999765 -0.976580 \n",
"25% ... -0.023692 -0.542602 -0.845573 -0.121527 \n",
"50% ... 0.134000 -0.343685 -0.711692 0.009509 \n",
"75% ... 0.289096 -0.126979 -0.503878 0.150865 \n",
"max ... 0.946700 0.989538 0.956845 1.000000 \n",
"\n",
" 555 556 557 558 559 \\\n",
"count 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 \n",
"mean 0.002186 0.008726 -0.005981 -0.489547 0.058593 \n",
"std 0.448306 0.608303 0.477975 0.511807 0.297480 \n",
"min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 \n",
"25% -0.289549 -0.482273 -0.376341 -0.812065 -0.017885 \n",
"50% 0.008943 0.008735 -0.000368 -0.709417 0.182071 \n",
"75% 0.292861 0.506187 0.359368 -0.509079 0.248353 \n",
"max 1.000000 0.998702 0.996078 1.000000 0.478157 \n",
"\n",
" 560 \n",
"count 7352.000000 \n",
"mean -0.056515 \n",
"std 0.279122 \n",
"min -1.000000 \n",
"25% -0.143414 \n",
"50% 0.003181 \n",
"75% 0.107659 \n",
"max 1.000000 \n",
"\n",
"[8 rows x 561 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.describe()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.149095Z",
"start_time": "2018-05-08T18:07:58.144029Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(7352, 1)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.172988Z",
"start_time": "2018-05-08T18:07:58.153024Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"0 5\n",
"1 5\n",
"2 5\n",
"3 5\n",
"4 5"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.200804Z",
"start_time": "2018-05-08T18:07:58.175887Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>7352</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1407</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"count 7352\n",
"unique 6\n",
"top 6\n",
"freq 1407"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vamos conferir quais são as classes disponíveis:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.213041Z",
"start_time": "2018-05-08T18:07:58.204487Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['1', '2', '3', '4', '5', '6']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T20:08:35.976687Z",
"start_time": "2018-04-17T20:08:35.969185Z"
}
},
"source": [
"Isso bate com a documentação que indica os seguintes labels para cada classe:\n",
" \n",
"- 1: WALKING\n",
"- 2: WALKING_UPSTAIRS\n",
"- 3: WALKING_DOWNSTAIRS\n",
"- 4: SITTING\n",
"- 5: STANDING\n",
"- 6: LAYING"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T20:10:49.923595Z",
"start_time": "2018-04-17T20:10:49.917327Z"
}
},
"source": [
"### 1.2 - Limpeza de Dados e Seleção de Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Primeiro vamos conferir se há algum valor Null no dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.235889Z",
"start_time": "2018-05-08T18:07:58.216625Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.isnull().values.any()"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T20:15:44.964605Z",
"start_time": "2018-04-17T20:15:44.953723Z"
}
},
"source": [
"Não havendo valores nulos, e eu não sendo capaz de examinar cada uma das 561 features, não vejo nenhuma limpeza necessária, nem mesmo seleção manual de features. Vamos testar com o uso do VarianceThreshold."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.730777Z",
"start_time": "2018-05-08T18:07:58.241144Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(7352, 99)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sel = VarianceThreshold(.8 * (1 - .8))\n",
"fs_features = sel.fit(X_train).get_support(indices=True)\n",
"X_train_fs = X_train[fs_features]\n",
"X_test_fs = X_test[fs_features]\n",
"X_train_fs.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Foi uma grande redução, vamos ver se traz alguma melhoria nos algoritmos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 - Normalização\n",
"\n",
"Como podemos ver, todos os valores estão na mesma escala, entre -1 e 1, então não vejo necessidade de nenhuma alteração."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.828158Z",
"start_time": "2018-05-08T18:07:58.737806Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Confere se há algum valor menor que 1\n",
"(X_train.min() < -1).any()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:58.906005Z",
"start_time": "2018-05-08T18:07:58.830971Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Confere se há algum valor maior que 1\n",
"(X_train.max() > 1).any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4 - Balanceamento\n",
"\n",
"Vamos começar vendo quantas entradas temos para cada classe:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.284600Z",
"start_time": "2018-05-08T18:07:58.908989Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x10636f1d0>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD4CAYAAADlwTGnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEVVJREFUeJzt3X+s3Xddx/Hni5VN5491sMucbeOdWoUpKvO6TTEGqY5tELoYFjeNNHPamAxF5w+KGOuPYDD+mJDgTEMLXYIDnJpVnWIzQIK6yd0c+0HBXQeu1/26ZmOoU7Hy9o/zqR7a2972nNtz1n2ej+TmfL/vz+ecz/ubdPd1v9/vOWepKiRJ/XnOtBuQJE2HASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnq1JppN3A0Z511Vs3Ozk67DUk6qdx5553/UlUzK817RgfA7Ows8/Pz025Dkk4qSf7pWOZ5CUiSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1asUASLIryeNJ7ltm7GeSVJKz2n6SvC3JQpJ7kpw/NHdLkgfaz5bVPQxJ0vE6ljOAdwGXHFpMsgH4XuChofKlwMb2sxW4oc19HrAduBC4ANie5MxxGpckjWfFD4JV1YeTzC4zdD3wc8AtQ7XNwI01+B8N355kbZJzgJcBe6vqCYAkexmEyk1jdX8Es9v+7ES87BF9+i2vnOh6krQaRroHkOTVwD9X1ccOGVoH7B/aX2y1I9WXe+2tSeaTzC8tLY3SniTpGBx3ACQ5HXgT8IvLDS9Tq6PUDy9W7aiquaqam5lZ8assJEkjGuUM4GuAc4GPJfk0sB64K8lXMPjLfsPQ3PXAw0epS5Km5LgDoKruraoXVNVsVc0y+OV+flU9CuwBXtveDXQR8FRVPQK8H7g4yZnt5u/FrSZJmpIVbwInuYnBTdyzkiwC26tq5xGm3wpcBiwATwNXA1TVE0l+Ffhom/crB28IawS/dMaE13tqsutJmohjeRfQVSuMzw5tF3DtEebtAnYdZ3+SpBPETwJLUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdWvG7gKRJe/HuF090vXu33DvR9aRnCs8AJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ3yg2DShO174Ysmut6LPrFvouvp5LHiGUCSXUkeT3LfUO03knwiyT1J/jjJ2qGxNyZZSPLJJK8Yql/SagtJtq3+oUiSjsexXAJ6F3DJIbW9wDdW1TcB/wC8ESDJecCVwDe05/xuklOSnAK8HbgUOA+4qs2VJE3JigFQVR8Gnjik9pdVdaDt3g6sb9ubgfdU1X9V1aeABeCC9rNQVQ9W1eeA97S5kqQpWY2bwD8M/HnbXgfsHxpbbLUj1Q+TZGuS+STzS0tLq9CeJGk5YwVAkjcBB4B3HywtM62OUj+8WLWjquaqam5mZmac9iRJRzHyu4CSbAFeBWyqqoO/zBeBDUPT1gMPt+0j1SVJUzDSGUCSS4A3AK+uqqeHhvYAVyY5Lcm5wEbg74CPAhuTnJvkVAY3iveM17okaRwrngEkuQl4GXBWkkVgO4N3/ZwG7E0CcHtV/VhV3Z/kfcDHGVwauraq/qe9zuuA9wOnALuq6v4TcDySpGO0YgBU1VXLlHceZf6bgTcvU78VuPW4upMknTB+FYQkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOjXy/w9Akpbz9h/7wETXu/b3Xj7R9Z5NPAOQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTKwZAkl1JHk9y31DteUn2JnmgPZ7Z6knytiQLSe5Jcv7Qc7a0+Q8k2XJiDkeSdKyO5QzgXcAlh9S2AbdV1UbgtrYPcCmwsf1sBW6AQWAA24ELgQuA7QdDQ5I0HSsGQFV9GHjikPJmYHfb3g1cPlS/sQZuB9YmOQd4BbC3qp6oqieBvRweKpKkCRr1HsDZVfUIQHt8QauvA/YPzVtstSPVD5Nka5L5JPNLS0sjtidJWslq3wTOMrU6Sv3wYtWOqpqrqrmZmZlVbU6S9P9G/TK4x5KcU1WPtEs8j7f6IrBhaN564OFWf9kh9Q+NuLYkTcVvff+rJrreT7/3T0/o6496BrAHOPhOni3ALUP117Z3A10EPNUuEb0fuDjJme3m78WtJkmakhXPAJLcxOCv97OSLDJ4N89bgPcluQZ4CLiiTb8VuAxYAJ4GrgaoqieS/Crw0TbvV6rq0BvLkqQJWjEAquqqIwxtWmZuAdce4XV2AbuOqztJ0gnjJ4ElqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktSpsQIgyU8luT/JfUluSvJFSc5NckeSB5K8N8mpbe5pbX+hjc+uxgFIkkYzcgAkWQf8BDBXVd8InAJcCfw6cH1VbQSeBK5pT7kGeLKqvha4vs2TJE3JuJeA1gBfnGQNcDrwCPBy4OY2vhu4vG1vbvu08U1JMub6kqQRjRwAVfXPwG8CDzH4xf8UcCfwmao60KYtAuva9jpgf3vugTb/+aOuL0kazziXgM5k8Ff9ucBXAl8CXLrM1Dr4lKOMDb/u1iTzSeaXlpZGbU+StIJxLgF9D/Cpqlqqqv8G/gj4DmBtuyQEsB54uG0vAhsA2vgZwBOHvmhV7aiquaqam5mZGaM9SdLRjBMADwEXJTm9XcvfBHwc+CDwmjZnC3BL297T9mnjH6iqw84AJEmTMc49gDsY3My9C7i3vdYO4A3AdUkWGFzj39meshN4fqtfB2wbo29J0pjWrDzlyKpqO7D9kPKDwAXLzP1P4Ipx1pMkrR4/CSxJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqbECIMnaJDcn+USSfUm+PcnzkuxN8kB7PLPNTZK3JVlIck+S81fnECRJoxj3DOCtwF9U1QuBbwb2AduA26pqI3Bb2we4FNjYfrYCN4y5tiRpDCMHQJIvB74L2AlQVZ+rqs8Am4Hdbdpu4PK2vRm4sQZuB9YmOWfkziVJYxnnDOCrgSXgnUn+Psk7knwJcHZVPQLQHl/Q5q8D9g89f7HVvkCSrUnmk8wvLS2N0Z4k6WjGCYA1wPnADVX1EuDf+f/LPcvJMrU6rFC1o6rmqmpuZmZmjPYkSUczTgAsAotVdUfbv5lBIDx28NJOe3x8aP6GoeevBx4eY31J0hhGDoCqehTYn+TrW2kT8HFgD7Cl1bYAt7TtPcBr27uBLgKeOnipSJI0eWvGfP6PA+9OcirwIHA1g1B5X5JrgIeAK9rcW4HLgAXg6TZXkjQlYwVAVd0NzC0ztGmZuQVcO856kqTV4yeBJalTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqbEDIMkpSf4+yZ+2/XOT3JHkgSTvTXJqq5/W9hfa+Oy4a0uSRrcaZwCvB/YN7f86cH1VbQSeBK5p9WuAJ6vqa4Hr2zxJ0pSMFQBJ1gOvBN7R9gO8HLi5TdkNXN62N7d92vimNl+SNAXjngH8DvBzwOfb/vOBz1TVgba/CKxr2+uA/QBt/Kk2X5I0BSMHQJJXAY9X1Z3D5WWm1jGMDb/u1iTzSeaXlpZGbU+StIJxzgBeCrw6yaeB9zC49PM7wNoka9qc9cDDbXsR2ADQxs8Anjj0RatqR1XNVdXczMzMGO1Jko5m5ACoqjdW1fqqmgWuBD5QVT8IfBB4TZu2Bbilbe9p+7TxD1TVYWcAkqTJOBGfA3gDcF2SBQbX+He2+k7g+a1+HbDtBKwtSTpGa1aesrKq+hDwobb9IHDBMnP+E7hiNdaTJI3PTwJLUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdWrkAEiyIckHk+xLcn+S17f685LsTfJAezyz1ZPkbUkWktyT5PzVOghJ0vEb5wzgAPDTVfUi4CLg2iTnAduA26pqI3Bb2we4FNjYfrYCN4yxtiRpTCMHQFU9UlV3te1/BfYB64DNwO42bTdwedveDNxYA7cDa5OcM3LnkqSxrMo9gCSzwEuAO4Czq+oRGIQE8II2bR2wf+hpi6126GttTTKfZH5paWk12pMkLWPsAEjypcAfAj9ZVZ892tRlanVYoWpHVc1V1dzMzMy47UmSjmCsAEjyXAa//N9dVX/Uyo8dvLTTHh9v9UVgw9DT1wMPj7O+JGl047wLKMBOYF9V/fbQ0B5gS9veAtwyVH9tezfQRcBTBy8VSZImb80Yz30p8EPAvUnubrWfB94CvC/JNcBDwBVt7FbgMmABeBq4eoy1JUljGjkAquojLH9dH2DTMvMLuHbU9SRJq8tPAktSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMTD4AklyT5ZJKFJNsmvb4kaWCiAZDkFODtwKXAecBVSc6bZA+SpIFJnwFcACxU1YNV9TngPcDmCfcgSQJSVZNbLHkNcElV/Ujb/yHgwqp63dCcrcDWtvv1wCcn1iCcBfzLBNebNI/v5ObxnbwmfWxfVVUzK01aM4lOhmSZ2hckUFXtAHZMpp0vlGS+quamsfYkeHwnN4/v5PVMPbZJXwJaBDYM7a8HHp5wD5IkJh8AHwU2Jjk3yanAlcCeCfcgSWLCl4Cq6kCS1wHvB04BdlXV/ZPsYQVTufQ0QR7fyc3jO3k9I49tojeBJUnPHH4SWJI6ZQBIUqcMAEnqlAGgZ4UkN067B+mgJBck+ba2fV6S65JcNu2+DjXpD4I9YyS5ENhXVZ9N8sXANuB84OPAr1XVU1NtcJUl+U4GX8VxX1X95bT7GUeSQ986HOC7k6wFqKpXT74rHY8kLwTWAXdU1b8N1S+pqr+YXmfjS7KdwfedrUmyF7gQ+BCwLclLqurN0+xvWLfvAkpyP/DN7a2pO4CngZuBTa3+fVNtcExJ/q6qLmjbPwpcC/wxcDHwJ1X1lmn2N44kdzEI6ncw+CR5gJsYfK6Eqvqr6XV34iW5uqreOe0+RpXkJxj8e9wHfAvw+qq6pY3dVVXnT7O/cSW5l8FxnQY8Cqwf+kPzjqr6pqk2OKTbMwDgOVV1oG3PDf2j+0iSu6fV1Cp67tD2VuB7q2opyW8CtwMnbQAAc8DrgTcBP1tVdyf5j2f7L/4hvwyctAEA/CjwrVX1b0lmgZuTzFbVW1n+62JONgeq6n+Ap5P8Y1V9FqCq/iPJ56fc2xfoOQDuG/pL6mNJ5qpqPsnXAf897eZWwXOSnMngPk+qagmgqv49yYGjP/WZrao+D1yf5A/a42M8y/4tJ7nnSEPA2ZPs5QQ45eBln6r6dJKXMQiBr+LZEQCfS3J6VT0NfOvBYpIzAAPgGeJHgLcm+QUG39L3t0n2A/vb2MnuDOBOBv9BVZKvqKpHk3wpz47/yKiqReCKJK8EPjvtflbZ2cArgCcPqQf4m8m3s6oeTfItVXU3QDsTeBWwC3jxdFtbFd9VVf8F//fHykHPBbZMp6XldXsP4KAkXwZ8NYMwXKyqx6bc0gmV5HTg7Kr61LR70ZEl2Qm8s6o+sszY71fVD0yhrVWRZD2DyySPLjP20qr66ym01aXuA0CSeuXnACSpUwaAJHXKAJCkThkAktSp/wUhlrcC2ZQtAgAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"y_train[0].value_counts().plot(kind='bar')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.302271Z",
"start_time": "2018-05-08T18:07:59.288477Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"421"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train[0].value_counts()['6'] - y_train[0].value_counts()['3'] "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Há uma diferença de 421 entradas entre a classe com mais e a com menos entradas. Não creio que justifique o uso de qualquer técnica de balanceamento."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 - Treinamento"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.0 - Helper Functions"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.316223Z",
"start_time": "2018-05-08T18:07:59.304751Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"def plot_confusion_matrix(cm, classes,\n",
" normalize=False,\n",
" title='Confusion matrix',\n",
" cmap=plt.cm.Blues):\n",
" \"\"\"\n",
" This function prints and plots the confusion matrix.\n",
" Normalization can be applied by setting `normalize=True`.\n",
" \"\"\"\n",
" if normalize:\n",
" cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
" print(\"Normalized confusion matrix\")\n",
" else:\n",
" print('Confusion matrix, without normalization')\n",
"\n",
" plt.figure()\n",
" plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
" plt.title(title)\n",
" plt.colorbar()\n",
" tick_marks = np.arange(len(classes))\n",
" plt.xticks(tick_marks, classes, rotation=45)\n",
" plt.yticks(tick_marks, classes)\n",
"\n",
" fmt = '.2f' if normalize else 'd'\n",
" thresh = cm.max() / 2.\n",
" for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
" plt.text(j, i, format(cm[i, j], fmt),\n",
" horizontalalignment=\"center\",\n",
" color=\"white\" if cm[i, j] > thresh else \"black\")\n",
"\n",
" plt.tight_layout()\n",
" plt.ylabel('True label')\n",
" plt.xlabel('Predicted label')\n",
" plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.326662Z",
"start_time": "2018-05-08T18:07:59.319428Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"def report(y_real, y_pred):\n",
" print(\"{} Erros de {} entradas\".format((y_real != y_pred).sum(), y_real.shape))\n",
" print(\"Acurácia: {}\".format(accuracy_score(y_real, y_pred)))\n",
" print(\"Precisão:\")\n",
" print(precision_score(y_real, y_pred, average=None))\n",
" print(\"Sensibilidade:\")\n",
" print(recall_score(y_real, y_pred, average=None))\n",
" plot_confusion_matrix(confusion_matrix(y_real, y_pred), labels)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.342576Z",
"start_time": "2018-05-08T18:07:59.334374Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"def full_report(y_train, y_train_pred, y_test, y_test_pred):\n",
" print(\"TRAIN DATASET\")\n",
" report(y_train, y_train_pred)\n",
" print(\"\")\n",
" print(\"TEST DATASET\")\n",
" report(y_test, y_test_pred)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.353348Z",
"start_time": "2018-05-08T18:07:59.347657Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"def resumed_report(y_train, y_train_pred, y_test, y_test_pred):\n",
" print(\"Acurácia: {}\".format(accuracy_score(y_train, y_train_pred)))\n",
" print(\"Acurácia: {}\".format(accuracy_score(y_test, y_test_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T20:42:57.882639Z",
"start_time": "2018-04-17T20:42:57.876992Z"
}
},
"source": [
"### 2.1 - Naive Bayes"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:07:59.687908Z",
"start_time": "2018-05-08T18:07:59.358384Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"nb = BernoulliNB()\n",
"y_train_pred = nb.fit(X_train, y_train[0]).predict(X_train)\n",
"y_test_pred = nb.predict(X_test) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<B>Resultado</b>"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:00.407079Z",
"start_time": "2018-05-08T18:07:59.690439Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TRAIN DATASET\n",
"1025 Erros de (7352,) entradas\n",
"Acurácia: 0.860582154515778\n",
"Precisão:\n",
"[0.91168353 0.76415826 0.84090909 0.84210526 0.80230196 1. ]\n",
"Sensibilidade:\n",
"[0.80831974 0.91798695 0.78803245 0.77138414 0.86244541 0.99289268]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"TEST DATASET\n",
"442 Erros de (2947,) entradas\n",
"Acurácia: 0.8500169664065151\n",
"Precisão:\n",
"[0.75462185 0.83168317 0.85273973 0.88305489 0.80133556 1. ]\n",
"Sensibilidade:\n",
"[0.90524194 0.89171975 0.59285714 0.75356415 0.90225564 1. ]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"full_report(y_train[0], y_train_pred, y_test[0], y_test_pred)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:00.467676Z",
"start_time": "2018-05-08T18:08:00.409658Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"nb_score = nb.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-17T21:21:20.342197Z",
"start_time": "2018-04-17T21:21:20.327273Z"
}
},
"source": [
"Notas: A precisão da classe 4 e a sensibilidade das classes 3 e 6 estão baixas. E a classe 6, mesmo sendo a que contém maior quantidade de entradas está com uma sensibilidade bem baixa, o algoritmo está classificando 40% delas como classe 4."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agora testando com a seleção de features:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:00.627554Z",
"start_time": "2018-05-08T18:08:00.470385Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.7702748557855447"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nb = BernoulliNB()\n",
"y_train_pred = nb.fit(X_train_fs, y_train[0]).predict(X_train_fs)\n",
"y_test_pred = nb.predict(X_test_fs)\n",
"nb.score(X_test_fs, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, pelo visto os features removidos tinham a sua importância. Vamos ver se esse padrão se repete com os outros algoritmos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 - Ensemble - Random Forest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para o RandomForest eu vou explorar o número de estimadores:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:01.674838Z",
"start_time": "2018-05-08T18:08:00.631843Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9990478781284005, 0.9107567017305734)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# n_estimators=10 is the default\n",
"rf = RandomForestClassifier(random_state=0, n_jobs=-1)\n",
"rf.fit(X_train, y_train[0])\n",
"rf.score(X_train, y_train), rf.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:03.401104Z",
"start_time": "2018-05-08T18:08:01.678296Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9998639825897715, 0.9148286392941974)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf = RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=20)\n",
"rf.fit(X_train, y_train[0])\n",
"rf.score(X_train, y_train), rf.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:06.157732Z",
"start_time": "2018-05-08T18:08:03.405717Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1.0, 0.9185612487275195)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf = RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=40)\n",
"rf.fit(X_train, y_train[0])\n",
"rf.score(X_train, y_train), rf.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:11.048872Z",
"start_time": "2018-05-08T18:08:06.162221Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1.0, 0.9216152019002375)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf = RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=80)\n",
"rf.fit(X_train, y_train[0])\n",
"rf.score(X_train, y_train), rf.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:23.272972Z",
"start_time": "2018-05-08T18:08:11.051497Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1.0, 0.9263657957244655)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf = RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=160)\n",
"rf.fit(X_train, y_train[0])\n",
"rf.score(X_train, y_train), rf.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mesmo dobrando a quantidade de estimadores, 0.92 continua sendo o melhor resultado para o dataset de validação."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:24.426764Z",
"start_time": "2018-05-08T18:08:23.276243Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TRAIN DATASET\n",
"0 Erros de (7352,) entradas\n",
"Acurácia: 1.0\n",
"Precisão:\n",
"[1. 1. 1. 1. 1. 1.]\n",
"Sensibilidade:\n",
"[1. 1. 1. 1. 1. 1.]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"TEST DATASET\n",
"217 Erros de (2947,) entradas\n",
"Acurácia: 0.9263657957244655\n",
"Precisão:\n",
"[0.88970588 0.90063425 0.95675676 0.90927835 0.9070632 1. ]\n",
"Sensibilidade:\n",
"[0.97580645 0.9044586 0.84285714 0.89816701 0.91729323 1. ]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"y_train_pred = rf.predict(X_train)\n",
"y_test_pred = rf.predict(X_test)\n",
"full_report(y_train[0], y_train_pred, y_test[0], y_test_pred)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:24.672570Z",
"start_time": "2018-05-08T18:08:24.429779Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"rf_score = rf.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vamos ver como fica utilizando o dataset com feature selection:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:08:28.494627Z",
"start_time": "2018-05-08T18:08:24.675895Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1.0, 0.9015948422124194)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf = RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=160)\n",
"rf.fit(X_train_fs, y_train[0])\n",
"rf.score(X_train_fs, y_train), rf.score(X_test_fs, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Piorou 2,5%, não tão ruim quanto o NB."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 - Ensemble - Gradient Boosting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vou começar definindo a baseline com as configurações padrões do xGBoost:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:11:24.255603Z",
"start_time": "2018-05-08T18:08:28.497941Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n",
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n"
]
},
{
"data": {
"text/plain": [
"(0.999455930359086, 0.9395995928062436)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xgb = XGBClassifier()\n",
"xgb.fit(X_train, y_train[0])\n",
"y_train_pred = xgb.predict(X_train)\n",
"y_test_pred = xgb.predict(X_test)\n",
"accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"93.95% é um bom avanço em relação aos 92.6% do Random Forest. Vamos testar aumentando o número de estimadores:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:21:04.057371Z",
"start_time": "2018-05-08T18:11:24.262036Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n",
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n"
]
},
{
"data": {
"text/plain": [
"(1.0, 0.9487614523243977)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xgb = XGBClassifier(n_jobs=-1, random_state=1, early_stopping_rounds=5, n_estimators=500)\n",
"xgb.fit(X_train, y_train[0])\n",
"y_train_pred = xgb.predict(X_train)\n",
"y_test_pred = xgb.predict(X_test)\n",
"accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quase 94.87% é um bom avanço!"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:21:05.064900Z",
"start_time": "2018-05-08T18:21:04.062837Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TRAIN DATASET\n",
"0 Erros de (7352,) entradas\n",
"Acurácia: 1.0\n",
"Precisão:\n",
"[1. 1. 1. 1. 1. 1.]\n",
"Sensibilidade:\n",
"[1. 1. 1. 1. 1. 1.]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"TEST DATASET\n",
"151 Erros de (2947,) entradas\n",
"Acurácia: 0.9487614523243977\n",
"Precisão:\n",
"[0.93857965 0.94218415 0.98258706 0.94481236 0.89417989 1. ]\n",
"Sensibilidade:\n",
"[0.9858871 0.93418259 0.94047619 0.87169043 0.95300752 1. ]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"full_report(y_train[0], y_train_pred, y_test[0], y_test_pred)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:21:05.080566Z",
"start_time": "2018-05-08T18:21:05.067669Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"xgb_score = accuracy_score(y_test, y_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Testando o dataset com menos features:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:22:59.713602Z",
"start_time": "2018-05-08T18:21:05.084973Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n",
"/Users/thiago/anaconda3/envs/uniritter/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n",
" if diff:\n"
]
},
{
"data": {
"text/plain": [
"(1.0, 0.9233118425517476)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xgb = XGBClassifier(n_jobs=-1, random_state=1, early_stopping_rounds=5, n_estimators=500)\n",
"xgb.fit(X_train_fs, y_train[0])\n",
"y_train_pred = xgb.predict(X_train_fs)\n",
"y_test_pred = xgb.predict(X_test_fs)\n",
"accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, 2.5% de diferença."
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-19T21:45:40.675802Z",
"start_time": "2018-04-19T21:45:40.669718Z"
}
},
"source": [
"### 2.4 - Rede Neural"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Antes de explocar qualquer parâmetro vamos ver qual é a nossa baseline:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:23:03.981846Z",
"start_time": "2018-05-08T18:22:59.717046Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9881664853101197, 0.9450288428910757)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(random_state=3)\n",
"mlp.fit(X_train, y_train[0])\n",
"mlp.score(X_train, y_train), mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Em relação ao RF, parece que ele teve um overfitting menor para o conjunto de treino mas obteve um resultado muito superior no conjunto de validação.\n",
"\n",
"Vamos começar explorando a quantidade de camadas e neurônios:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:23:03.992403Z",
"start_time": "2018-05-08T18:23:03.985502Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"features = 561"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:23:38.612558Z",
"start_time": "2018-05-08T18:23:03.996187Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9767410228509249, 0.9443501866304717)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(hidden_layer_sizes=(2*features, int(features/2)), random_state=3)\n",
"mlp.fit(X_train, y_train[0])\n",
"mlp.score(X_train, y_train), mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Com 2 camadas ele não mudou muito o resultado, vamos ver com mais camadas:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:26:50.795540Z",
"start_time": "2018-05-08T18:23:38.617771Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9854461371055495, 0.9365456396335257)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(hidden_layer_sizes=(2*features, 3*features, 2*features, int(features/2)), random_state=3)\n",
"mlp.fit(X_train, y_train[0])\n",
"mlp.score(X_train, y_train), mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Essa arquitetura com 4 camadas internas não ajudou muito.\n",
"\n",
"Vamos ver o que acontece com uma quantidade exata de features."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:11.105467Z",
"start_time": "2018-05-08T18:26:50.804689Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9865342763873776, 0.9409569053274517)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(hidden_layer_sizes=(features), random_state=3)\n",
"mlp.fit(X_train, y_train[0])\n",
"mlp.score(X_train, y_train), mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-19T22:00:02.188418Z",
"start_time": "2018-04-19T22:00:02.179360Z"
}
},
"source": [
"Ainda não. E o que acontece se diminuirmos a quantidade para mennos de 100:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:15.170830Z",
"start_time": "2018-05-08T18:27:11.109091Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9895266594124048, 0.9504580929759077)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(hidden_layer_sizes=(50,), random_state=3)\n",
"mlp.fit(X_train, y_train[0])\n",
"mlp.score(X_train, y_train), mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-19T22:02:37.239304Z",
"start_time": "2018-04-19T22:02:37.230272Z"
}
},
"source": [
"O resultado parece ser um pouco melhor do que com 100 neurônios."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:15.892738Z",
"start_time": "2018-05-08T18:27:15.174430Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TRAIN DATASET\n",
"77 Erros de (7352,) entradas\n",
"Acurácia: 0.9895266594124048\n",
"Precisão:\n",
"[1. 1. 1. 0.96250956 0.97930525 1. ]\n",
"Sensibilidade:\n",
"[1. 1. 1. 0.97822706 0.9643377 1. ]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"TEST DATASET\n",
"146 Erros de (2947,) entradas\n",
"Acurácia: 0.9504580929759077\n",
"Precisão:\n",
"[0.95472441 0.95503212 0.96626506 0.94193548 0.89222615 1. ]\n",
"Sensibilidade:\n",
"[0.97782258 0.94692144 0.9547619 0.89205703 0.94924812 0.97951583]\n",
"Confusion matrix, without normalization\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"y_train_pred = mlp.predict(X_train)\n",
"y_test_pred = mlp.predict(X_test)\n",
"full_report(y_train[0], y_train_pred, y_test[0], y_test_pred)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:15.917803Z",
"start_time": "2018-05-08T18:27:15.895802Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"mlp_score = mlp.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Testando com o dataset com menos features (feature selection):"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:19.215678Z",
"start_time": "2018-05-08T18:27:15.921233Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.9658596300326442, 0.9273837801153716)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mlp = MLPClassifier(hidden_layer_sizes=(50,), random_state=3)\n",
"mlp.fit(X_train_fs, y_train[0])\n",
"mlp.score(X_train_fs, y_train), mlp.score(X_test_fs, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Novamente inferior ao valor com o full dataset, mas valeu o experimento."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 - Comparando os resultados"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2018-05-08T18:27:19.413926Z",
"start_time": "2018-05-08T18:27:19.219321Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.set_style(\"whitegrid\")\n",
"ax = sns.barplot(x=\"Algoritimo\", y=\"Acurácia\", data=pd.DataFrame(data={\"Algoritimo\": [\"Naive Bayes\", \"Random Forest\", \"XGBoost\", \"MLP\"], \"Acurácia\": [nb_score, rf_score, xgb_score, mlp_score]}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A comparação foi feita apenas com o dataset de validação. Todos os algoritmos tiveram um bom resultado (>0.8), E apesar do Random Forest ter conseguido uma acurácia de 100% no dataset de treino, o Multi Layer Perceptron conseguiu se destacar como o dataset de validação. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:uniritter]",
"language": "python",
"name": "conda-env-uniritter-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment