Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save josejesus30/ee0f9028fb8c8f2bd384cfea53f1ff67 to your computer and use it in GitHub Desktop.
Save josejesus30/ee0f9028fb8c8f2bd384cfea53f1ff67 to your computer and use it in GitHub Desktop.
En este caso de uso práctico se presenta un mecanismo de extracción de características (reducción de dimensionalidad) mediante el uso del algoritmo PCA.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modelo: _Análisis de Componentes Principales para la reducción de la dimensionalidad_\n",
"En este caso de uso práctico se presenta un mecanismo de extracción de características (reducción de dimensionalidad) mediante el uso del algoritmo PCA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Autor: José Alamo Palomino"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Caso Práctico: Detección de malware en Android"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El sofisticado y avanzado malware de Android puede identificar la presencia del emulador utilizado por el analista de malware y, en respuesta, alterar su comportamiento para evadir la detección. Para superar este problema, instalamos las aplicaciones de Android en el dispositivo real y capturamos su tráfico de red. Vea nuestro Sandbox de Android disponible al público .\n",
"\n",
"El conjunto de datos CICAAGM se captura instalando las aplicaciones de Android en los teléfonos inteligentes reales semiautomatizados. El conjunto de datos se genera a partir de 1900 aplicaciones con las siguientes tres categorías:\n",
"\n",
"### 1. Adware (250 aplicaciones)\n",
"\n",
"* **Airpush:** diseñado para entregar anuncios no solicitados a los sistemas del usuario para el robo de información.\n",
"\n",
"* **Dowgin:** diseñado como una biblioteca de publicidad que también puede robar la información del usuario.\n",
"\n",
"* **Kemoge:** diseñado para hacerse cargo del dispositivo Android de un usuario. Este adware es un híbrido de botnet y se disfraza de aplicaciones populares a través del reempaquetado.\n",
"\n",
"* **Mobidash:** diseñado para mostrar anuncios y comprometer la información personal del usuario.\n",
"\n",
"* **Shuanet:** similar a Kemoge, Shuanet también está diseñado para hacerse cargo del dispositivo de un usuario.\n",
"\n",
"### 2. Malware general (150 aplicaciones)\n",
"\n",
"* **AVpass:** diseñado para ser distribuido bajo la apariencia de una aplicación de reloj.\n",
"\n",
"* **FakeAV:** Diseñado como una estafa que engaña al usuario para que compre una versión completa del software con el fin de mediar infecciones no existentes.\n",
"\n",
"* **FakeFlash / FakePlayer:** diseñado como una aplicación Flash falsa para dirigir a los usuarios a un sitio web (después de una instalación exitosa).\n",
"\n",
"* **GGtracker:** diseñado para el fraude por SMS (envía mensajes SMS a un número de tarifa premium) y robo de información.\n",
"\n",
"* **Penetho:** diseñado como un servicio falso (hacktool para dispositivos Android que se puede usar para descifrar la contraseña de WiFi). El malware también puede infectar la computadora del usuario a través de archivos adjuntos de correo electrónico infectados, actualizaciones falsas, medios externos y documentos infectados.\n",
"\n",
"### 3. Benigno (1500 aplicaciones)\n",
"\n",
"* 2015 GooglePlay market (top gratis popular y top gratis nuevo)\n",
"* 2016 GooglePlay market (top gratis popular y top gratis nuevo)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observación:\n",
"El conjunto de datos esta formado mendiante la instalación de aplicaciones en dispositivos android, y una vez que se han instalado esas aplicaciones lo que se hace es capturar el tráfico de red que generan esos dispositivos.\n",
"\n",
"EL objetivo es entrenar un algoritmo de Random Forest que sea capaz de diferenciar flujos de tráfico de red pertenecientes a las clases benign, asware y GeneralMalware y que cuando llegue tráfico de red nuevo, nuestro algoritmo de random forest devolverá alguna de las 3 categorias mencionadas. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">1. Importando librerías necesarias</h2>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import f1_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">2. Funciones auxiliares</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Función para la partición del conjunto de datos"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):\n",
" strat = df[stratify] if stratify else None\n",
" train_set, test_set = train_test_split(\n",
" df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" strat = test_set[stratify] if stratify else None\n",
" val_set, test_set = train_test_split(\n",
" test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" return (train_set, val_set, test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Función para separar las características de entrada de las de salida"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def remove_labels(df, label_name):\n",
" X = df.drop(label_name, axis=1)\n",
" y = df[label_name].copy()\n",
" return (X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">3. Lectura del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>530</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>50240627</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" <td>4767</td>\n",
" <td>6107</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>207.260870</td>\n",
" <td>...</td>\n",
" <td>9842879.0</td>\n",
" <td>9964749</td>\n",
" <td>1.196806e+05</td>\n",
" <td>2</td>\n",
" <td>317952</td>\n",
" <td>107008</td>\n",
" <td>11</td>\n",
" <td>23</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>35471450</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>52</td>\n",
" <td>104</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>35300000.0</td>\n",
" <td>35290631</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>3904</td>\n",
" <td>88704</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>32</td>\n",
" <td>asware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>41713629</td>\n",
" <td>12</td>\n",
" <td>26</td>\n",
" <td>1821</td>\n",
" <td>18643</td>\n",
" <td>40</td>\n",
" <td>40</td>\n",
" <td>489</td>\n",
" <td>1390</td>\n",
" <td>151.750000</td>\n",
" <td>...</td>\n",
" <td>20200000.0</td>\n",
" <td>32711382</td>\n",
" <td>1.770000e+07</td>\n",
" <td>2</td>\n",
" <td>227456</td>\n",
" <td>2432</td>\n",
" <td>23</td>\n",
" <td>12</td>\n",
" <td>20</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>50110119</td>\n",
" <td>20</td>\n",
" <td>23</td>\n",
" <td>4130</td>\n",
" <td>6043</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>206.500000</td>\n",
" <td>...</td>\n",
" <td>9873329.4</td>\n",
" <td>9906007</td>\n",
" <td>4.737363e+04</td>\n",
" <td>2</td>\n",
" <td>266112</td>\n",
" <td>59904</td>\n",
" <td>11</td>\n",
" <td>20</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"... ... ... ... ... ... \n",
"631950 530 1 1 74 334 \n",
"631951 50240627 23 24 4767 6107 \n",
"631952 35471450 1 2 52 104 \n",
"631953 41713629 12 26 1821 18643 \n",
"631954 50110119 20 23 4130 6043 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... \\\n",
"0 52 52 679 1390 53.431138 ... \n",
"1 75 124 75 124 75.000000 ... \n",
"2 52 -1 83 -1 62.333333 ... \n",
"3 52 52 706 1390 160.888889 ... \n",
"4 52 52 331 1005 132.000000 ... \n",
"... ... ... ... ... ... ... \n",
"631950 74 334 74 334 74.000000 ... \n",
"631951 52 52 533 855 207.260870 ... \n",
"631952 52 52 52 52 52.000000 ... \n",
"631953 40 40 489 1390 151.750000 ... \n",
"631954 52 52 533 855 206.500000 ... \n",
"\n",
" mean_idle max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 0.0 -1 0.000000e+00 2 4194240 \n",
"1 0.0 -1 0.000000e+00 2 0 \n",
"2 0.0 -1 0.000000e+00 4 101888 \n",
"3 0.0 -1 0.000000e+00 2 4194240 \n",
"4 0.0 -1 0.000000e+00 2 155136 \n",
"... ... ... ... ... ... \n",
"631950 0.0 -1 0.000000e+00 2 0 \n",
"631951 9842879.0 9964749 1.196806e+05 2 317952 \n",
"631952 35300000.0 35290631 0.000000e+00 2 3904 \n",
"631953 20200000.0 32711382 1.770000e+07 2 227456 \n",
"631954 9873329.4 9906007 4.737363e+04 2 266112 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"... ... ... ... \n",
"631950 0 0 1 \n",
"631951 107008 11 23 \n",
"631952 88704 1 1 \n",
"631953 2432 23 12 \n",
"631954 59904 11 20 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"... ... ... \n",
"631950 0 benign \n",
"631951 32 GeneralMalware \n",
"631952 32 asware \n",
"631953 20 benign \n",
"631954 32 GeneralMalware \n",
"\n",
"[631955 rows x 80 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('datasets/TotalFeatures-ISCXFlowMeter.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <h2 style=\"color:blue\">4. Visualización del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>261876</td>\n",
" <td>7</td>\n",
" <td>6</td>\n",
" <td>1618</td>\n",
" <td>882</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>730</td>\n",
" <td>477</td>\n",
" <td>231.142857</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>926720</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>14</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>104</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>5824</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>29675</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>806635</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>239</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>59.750000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>107008</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>56620</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1074</td>\n",
" <td>719</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>592</td>\n",
" <td>667</td>\n",
" <td>358.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>128512</td>\n",
" <td>10816</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"5 261876 7 6 1618 882 \n",
"6 14 2 0 104 0 \n",
"7 29675 1 1 71 213 \n",
"8 806635 4 0 239 0 \n",
"9 56620 3 2 1074 719 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... mean_idle \\\n",
"0 52 52 679 1390 53.431138 ... 0.0 \n",
"1 75 124 75 124 75.000000 ... 0.0 \n",
"2 52 -1 83 -1 62.333333 ... 0.0 \n",
"3 52 52 706 1390 160.888889 ... 0.0 \n",
"4 52 52 331 1005 132.000000 ... 0.0 \n",
"5 52 52 730 477 231.142857 ... 0.0 \n",
"6 52 -1 52 -1 52.000000 ... 0.0 \n",
"7 71 213 71 213 71.000000 ... 0.0 \n",
"8 52 -1 83 -1 59.750000 ... 0.0 \n",
"9 52 52 592 667 358.000000 ... 0.0 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 -1 0.0 2 4194240 \n",
"1 -1 0.0 2 0 \n",
"2 -1 0.0 4 101888 \n",
"3 -1 0.0 2 4194240 \n",
"4 -1 0.0 2 155136 \n",
"5 -1 0.0 2 4194240 \n",
"6 -1 0.0 3 5824 \n",
"7 -1 0.0 2 0 \n",
"8 -1 0.0 5 107008 \n",
"9 -1 0.0 3 128512 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"5 926720 3 7 \n",
"6 -1 0 2 \n",
"7 0 0 1 \n",
"8 -1 0 4 \n",
"9 10816 1 3 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"5 32 benign \n",
"6 32 benign \n",
"7 0 benign \n",
"8 32 benign \n",
"9 32 benign \n",
"\n",
"[10 rows x 80 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>min_idle</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>...</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.00000</td>\n",
" <td>631955.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2.195245e+07</td>\n",
" <td>6.728514</td>\n",
" <td>10.431934</td>\n",
" <td>9.540172e+02</td>\n",
" <td>1.206042e+04</td>\n",
" <td>141.475727</td>\n",
" <td>44.357688</td>\n",
" <td>263.675901</td>\n",
" <td>183.248084</td>\n",
" <td>174.959706</td>\n",
" <td>...</td>\n",
" <td>1.997327e+07</td>\n",
" <td>2.031228e+07</td>\n",
" <td>2.075238e+07</td>\n",
" <td>4.663875e+05</td>\n",
" <td>2.360896</td>\n",
" <td>9.620796e+05</td>\n",
" <td>3.104519e+05</td>\n",
" <td>9.733144</td>\n",
" <td>6.72471</td>\n",
" <td>19.965713</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.900578e+08</td>\n",
" <td>174.161354</td>\n",
" <td>349.424019</td>\n",
" <td>8.235040e+04</td>\n",
" <td>4.824716e+05</td>\n",
" <td>157.680880</td>\n",
" <td>89.099554</td>\n",
" <td>289.644383</td>\n",
" <td>371.863224</td>\n",
" <td>162.024811</td>\n",
" <td>...</td>\n",
" <td>1.897986e+08</td>\n",
" <td>1.897902e+08</td>\n",
" <td>1.899721e+08</td>\n",
" <td>6.199704e+06</td>\n",
" <td>3.041810</td>\n",
" <td>1.705655e+06</td>\n",
" <td>6.647956e+05</td>\n",
" <td>347.877923</td>\n",
" <td>174.13813</td>\n",
" <td>14.914261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-1.800000e+01</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000e+00</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>6.900000e+01</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2.445000e+04</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.840000e+02</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>8.761600e+04</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.759751e+06</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>4.270000e+02</td>\n",
" <td>1.670000e+02</td>\n",
" <td>108.000000</td>\n",
" <td>52.000000</td>\n",
" <td>421.000000</td>\n",
" <td>115.000000</td>\n",
" <td>356.000000</td>\n",
" <td>...</td>\n",
" <td>1.013498e+06</td>\n",
" <td>1.291379e+06</td>\n",
" <td>1.306116e+06</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>3.046400e+05</td>\n",
" <td>9.049600e+04</td>\n",
" <td>1.000000</td>\n",
" <td>3.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>4.431076e+10</td>\n",
" <td>48255.000000</td>\n",
" <td>74768.000000</td>\n",
" <td>4.049644e+07</td>\n",
" <td>1.039222e+08</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1500.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>...</td>\n",
" <td>4.431072e+10</td>\n",
" <td>4.430000e+10</td>\n",
" <td>4.431072e+10</td>\n",
" <td>8.470000e+08</td>\n",
" <td>2269.000000</td>\n",
" <td>4.194240e+06</td>\n",
" <td>4.194240e+06</td>\n",
" <td>74524.000000</td>\n",
" <td>48255.00000</td>\n",
" <td>44.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 79 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 6.319550e+05 \n",
"mean 2.195245e+07 6.728514 10.431934 9.540172e+02 \n",
"std 1.900578e+08 174.161354 349.424019 8.235040e+04 \n",
"min -1.800000e+01 0.000000 0.000000 0.000000e+00 \n",
"25% 0.000000e+00 1.000000 0.000000 6.900000e+01 \n",
"50% 2.445000e+04 1.000000 0.000000 1.840000e+02 \n",
"75% 1.759751e+06 3.000000 1.000000 4.270000e+02 \n",
"max 4.431076e+10 48255.000000 74768.000000 4.049644e+07 \n",
"\n",
" total_bpktl min_fpktl min_bpktl max_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 631955.000000 \n",
"mean 1.206042e+04 141.475727 44.357688 263.675901 \n",
"std 4.824716e+05 157.680880 89.099554 289.644383 \n",
"min 0.000000e+00 -1.000000 -1.000000 -1.000000 \n",
"25% 0.000000e+00 52.000000 -1.000000 52.000000 \n",
"50% 0.000000e+00 52.000000 -1.000000 83.000000 \n",
"75% 1.670000e+02 108.000000 52.000000 421.000000 \n",
"max 1.039222e+08 1390.000000 1390.000000 1500.000000 \n",
"\n",
" max_bpktl mean_fpktl ... min_idle mean_idle \\\n",
"count 631955.000000 631955.000000 ... 6.319550e+05 6.319550e+05 \n",
"mean 183.248084 174.959706 ... 1.997327e+07 2.031228e+07 \n",
"std 371.863224 162.024811 ... 1.897986e+08 1.897902e+08 \n",
"min -1.000000 0.000000 ... -1.000000e+00 0.000000e+00 \n",
"25% -1.000000 52.000000 ... -1.000000e+00 0.000000e+00 \n",
"50% -1.000000 83.000000 ... -1.000000e+00 0.000000e+00 \n",
"75% 115.000000 356.000000 ... 1.013498e+06 1.291379e+06 \n",
"max 1390.000000 1390.000000 ... 4.431072e+10 4.430000e+10 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"count 6.319550e+05 6.319550e+05 631955.000000 6.319550e+05 \n",
"mean 2.075238e+07 4.663875e+05 2.360896 9.620796e+05 \n",
"std 1.899721e+08 6.199704e+06 3.041810 1.705655e+06 \n",
"min -1.000000e+00 0.000000e+00 2.000000 -1.000000e+00 \n",
"25% -1.000000e+00 0.000000e+00 2.000000 0.000000e+00 \n",
"50% -1.000000e+00 0.000000e+00 2.000000 8.761600e+04 \n",
"75% 1.306116e+06 0.000000e+00 2.000000 3.046400e+05 \n",
"max 4.431072e+10 8.470000e+08 2269.000000 4.194240e+06 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"count 6.319550e+05 631955.000000 631955.00000 \n",
"mean 3.104519e+05 9.733144 6.72471 \n",
"std 6.647956e+05 347.877923 174.13813 \n",
"min -1.000000e+00 0.000000 0.00000 \n",
"25% -1.000000e+00 0.000000 1.00000 \n",
"50% -1.000000e+00 0.000000 1.00000 \n",
"75% 9.049600e+04 1.000000 3.00000 \n",
"max 4.194240e+06 74524.000000 48255.00000 \n",
"\n",
" min_seg_size_forward \n",
"count 631955.000000 \n",
"mean 19.965713 \n",
"std 14.914261 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 32.000000 \n",
"75% 32.000000 \n",
"max 44.000000 \n",
"\n",
"[8 rows x 79 columns]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 631955 entries, 0 to 631954\n",
"Data columns (total 80 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 duration 631955 non-null int64 \n",
" 1 total_fpackets 631955 non-null int64 \n",
" 2 total_bpackets 631955 non-null int64 \n",
" 3 total_fpktl 631955 non-null int64 \n",
" 4 total_bpktl 631955 non-null int64 \n",
" 5 min_fpktl 631955 non-null int64 \n",
" 6 min_bpktl 631955 non-null int64 \n",
" 7 max_fpktl 631955 non-null int64 \n",
" 8 max_bpktl 631955 non-null int64 \n",
" 9 mean_fpktl 631955 non-null float64\n",
" 10 mean_bpktl 631955 non-null float64\n",
" 11 std_fpktl 631955 non-null float64\n",
" 12 std_bpktl 631955 non-null float64\n",
" 13 total_fiat 631955 non-null int64 \n",
" 14 total_biat 631955 non-null int64 \n",
" 15 min_fiat 631955 non-null int64 \n",
" 16 min_biat 631955 non-null int64 \n",
" 17 max_fiat 631955 non-null int64 \n",
" 18 max_biat 631955 non-null int64 \n",
" 19 mean_fiat 631955 non-null float64\n",
" 20 mean_biat 631955 non-null float64\n",
" 21 std_fiat 631955 non-null float64\n",
" 22 std_biat 631955 non-null float64\n",
" 23 fpsh_cnt 631955 non-null int64 \n",
" 24 bpsh_cnt 631955 non-null int64 \n",
" 25 furg_cnt 631955 non-null int64 \n",
" 26 burg_cnt 631955 non-null int64 \n",
" 27 total_fhlen 631955 non-null int64 \n",
" 28 total_bhlen 631955 non-null int64 \n",
" 29 fPktsPerSecond 631955 non-null float64\n",
" 30 bPktsPerSecond 631955 non-null float64\n",
" 31 flowPktsPerSecond 631955 non-null float64\n",
" 32 flowBytesPerSecond 631955 non-null float64\n",
" 33 min_flowpktl 631955 non-null int64 \n",
" 34 max_flowpktl 631955 non-null int64 \n",
" 35 mean_flowpktl 631955 non-null float64\n",
" 36 std_flowpktl 631955 non-null float64\n",
" 37 min_flowiat 631955 non-null int64 \n",
" 38 max_flowiat 631955 non-null int64 \n",
" 39 mean_flowiat 631955 non-null float64\n",
" 40 std_flowiat 631955 non-null float64\n",
" 41 flow_fin 631955 non-null int64 \n",
" 42 flow_syn 631955 non-null int64 \n",
" 43 flow_rst 631955 non-null int64 \n",
" 44 flow_psh 631955 non-null int64 \n",
" 45 flow_ack 631955 non-null int64 \n",
" 46 flow_urg 631955 non-null int64 \n",
" 47 flow_cwr 631955 non-null int64 \n",
" 48 flow_ece 631955 non-null int64 \n",
" 49 downUpRatio 631955 non-null float64\n",
" 50 avgPacketSize 631955 non-null float64\n",
" 51 fAvgSegmentSize 631955 non-null float64\n",
" 52 fHeaderBytes 631955 non-null int64 \n",
" 53 fAvgBytesPerBulk 631955 non-null int64 \n",
" 54 fAvgPacketsPerBulk 631955 non-null int64 \n",
" 55 fAvgBulkRate 631955 non-null int64 \n",
" 56 bVarianceDataBytes 631955 non-null float64\n",
" 57 bAvgSegmentSize 631955 non-null int64 \n",
" 58 bAvgBytesPerBulk 631955 non-null int64 \n",
" 59 bAvgPacketsPerBulk 631955 non-null int64 \n",
" 60 bAvgBulkRate 631955 non-null int64 \n",
" 61 sflow_fpacket 631955 non-null int64 \n",
" 62 sflow_fbytes 631955 non-null int64 \n",
" 63 sflow_bpacket 631955 non-null int64 \n",
" 64 sflow_bbytes 631955 non-null int64 \n",
" 65 min_active 631955 non-null int64 \n",
" 66 mean_active 631955 non-null float64\n",
" 67 max_active 631955 non-null int64 \n",
" 68 std_active 631955 non-null float64\n",
" 69 min_idle 631955 non-null int64 \n",
" 70 mean_idle 631955 non-null float64\n",
" 71 max_idle 631955 non-null int64 \n",
" 72 std_idle 631955 non-null float64\n",
" 73 FFNEPD 631955 non-null int64 \n",
" 74 Init_Win_bytes_forward 631955 non-null int64 \n",
" 75 Init_Win_bytes_backward 631955 non-null int64 \n",
" 76 RRT_samples_clnt 631955 non-null int64 \n",
" 77 Act_data_pkt_forward 631955 non-null int64 \n",
" 78 min_seg_size_forward 631955 non-null int64 \n",
" 79 calss 631955 non-null object \n",
"dtypes: float64(24), int64(55), object(1)\n",
"memory usage: 385.7+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"benign 471597\n",
"asware 155613\n",
"GeneralMalware 4745\n",
"Name: calss, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['calss'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">5. Extracción de características: PCA</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El algoritmo PCA sirve para la extracción de características, y la diferencia fundamental con la selección de características es que nosotros vamos a **reducir el conjunto de datos original y transformarlo en un conjunto de datos nuevo**.\n",
"\n",
"Vamos a reducir la dimensionalidad pero tambien van a modificarse los valores de las características que se mantienen. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Separamos las variables de entrada (X) de la etiqueta (y)\n",
"X_df, y_df = remove_labels(df, 'calss')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>min_idle</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>530</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74.000000</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>50240627</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" <td>4767</td>\n",
" <td>6107</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>207.260870</td>\n",
" <td>...</td>\n",
" <td>9655008</td>\n",
" <td>9842879.0</td>\n",
" <td>9964749</td>\n",
" <td>1.196806e+05</td>\n",
" <td>2</td>\n",
" <td>317952</td>\n",
" <td>107008</td>\n",
" <td>11</td>\n",
" <td>23</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>35471450</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>52</td>\n",
" <td>104</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>35290631</td>\n",
" <td>35300000.0</td>\n",
" <td>35290631</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>3904</td>\n",
" <td>88704</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>41713629</td>\n",
" <td>12</td>\n",
" <td>26</td>\n",
" <td>1821</td>\n",
" <td>18643</td>\n",
" <td>40</td>\n",
" <td>40</td>\n",
" <td>489</td>\n",
" <td>1390</td>\n",
" <td>151.750000</td>\n",
" <td>...</td>\n",
" <td>7740379</td>\n",
" <td>20200000.0</td>\n",
" <td>32711382</td>\n",
" <td>1.770000e+07</td>\n",
" <td>2</td>\n",
" <td>227456</td>\n",
" <td>2432</td>\n",
" <td>23</td>\n",
" <td>12</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>50110119</td>\n",
" <td>20</td>\n",
" <td>23</td>\n",
" <td>4130</td>\n",
" <td>6043</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>206.500000</td>\n",
" <td>...</td>\n",
" <td>9792882</td>\n",
" <td>9873329.4</td>\n",
" <td>9906007</td>\n",
" <td>4.737363e+04</td>\n",
" <td>2</td>\n",
" <td>266112</td>\n",
" <td>59904</td>\n",
" <td>11</td>\n",
" <td>20</td>\n",
" <td>32</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 79 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"... ... ... ... ... ... \n",
"631950 530 1 1 74 334 \n",
"631951 50240627 23 24 4767 6107 \n",
"631952 35471450 1 2 52 104 \n",
"631953 41713629 12 26 1821 18643 \n",
"631954 50110119 20 23 4130 6043 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... min_idle \\\n",
"0 52 52 679 1390 53.431138 ... -1 \n",
"1 75 124 75 124 75.000000 ... -1 \n",
"2 52 -1 83 -1 62.333333 ... -1 \n",
"3 52 52 706 1390 160.888889 ... -1 \n",
"4 52 52 331 1005 132.000000 ... -1 \n",
"... ... ... ... ... ... ... ... \n",
"631950 74 334 74 334 74.000000 ... -1 \n",
"631951 52 52 533 855 207.260870 ... 9655008 \n",
"631952 52 52 52 52 52.000000 ... 35290631 \n",
"631953 40 40 489 1390 151.750000 ... 7740379 \n",
"631954 52 52 533 855 206.500000 ... 9792882 \n",
"\n",
" mean_idle max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 0.0 -1 0.000000e+00 2 4194240 \n",
"1 0.0 -1 0.000000e+00 2 0 \n",
"2 0.0 -1 0.000000e+00 4 101888 \n",
"3 0.0 -1 0.000000e+00 2 4194240 \n",
"4 0.0 -1 0.000000e+00 2 155136 \n",
"... ... ... ... ... ... \n",
"631950 0.0 -1 0.000000e+00 2 0 \n",
"631951 9842879.0 9964749 1.196806e+05 2 317952 \n",
"631952 35300000.0 35290631 0.000000e+00 2 3904 \n",
"631953 20200000.0 32711382 1.770000e+07 2 227456 \n",
"631954 9873329.4 9906007 4.737363e+04 2 266112 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"... ... ... ... \n",
"631950 0 0 1 \n",
"631951 107008 11 23 \n",
"631952 88704 1 1 \n",
"631953 2432 23 12 \n",
"631954 59904 11 20 \n",
"\n",
" min_seg_size_forward \n",
"0 32 \n",
"1 0 \n",
"2 32 \n",
"3 32 \n",
"4 32 \n",
"... ... \n",
"631950 0 \n",
"631951 32 \n",
"631952 32 \n",
"631953 20 \n",
"631954 32 \n",
"\n",
"[631955 rows x 79 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Variables de entrada\n",
"X_df"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 benign\n",
"1 benign\n",
"2 benign\n",
"3 benign\n",
"4 benign\n",
" ... \n",
"631950 benign\n",
"631951 GeneralMalware\n",
"631952 asware\n",
"631953 benign\n",
"631954 GeneralMalware\n",
"Name: calss, Length: 631955, dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Variables de salida\n",
"y_df"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, ..., 1, 0, 2], dtype=int64)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Transformamos y a valor numérico\n",
"y_df = y_df.factorize()[0]\n",
"y_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**La extracción de características es una técnica muy útil para representar un conjunto de datos multidimensional y ganar intuiciones sobre los límites de decisión que construye un algoritmo. Para ello podemos utilizar el algoritmo PCA y reducir el numero de características a 2.**\n",
"\n",
"Las técnicas de extracción de características como PCA, son muy útiles para transformar ese conjunto de datos inicial con muchas características o dimensiones en un conjunto de datos diferente pero que mantiene la distribución del conjunto de datos original, además tiene 2 ó 3 características, de manera que podemos representarlo gráficamente."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"# Reducimos el conjunto de datos a 2 dimensiones utilizando el algoritmo PCA\n",
"\n",
"# Instanciamos la clase PCA en el objeto \"pca\", pasándole como parámetro \"n_components=2\", que es el número de componentes\n",
"# o dimensiones.\n",
"pca = PCA(n_components=2)\n",
"\n",
"# aplicamos la propiedad fit_transform en X_df\n",
"df_reduced = pca.fit_transform(X_df)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-6.653632e+07</td>\n",
" <td>-9.564604e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-6.704580e+07</td>\n",
" <td>-9.898031e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-6.712784e+07</td>\n",
" <td>-9.875840e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-6.699982e+07</td>\n",
" <td>-9.782837e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-6.702599e+07</td>\n",
" <td>-9.829385e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>-6.712883e+07</td>\n",
" <td>-9.876545e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>-1.490810e+07</td>\n",
" <td>1.783457e+07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>2.098170e+07</td>\n",
" <td>6.844411e+07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>1.163357e+07</td>\n",
" <td>3.237585e+07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>-1.486282e+07</td>\n",
" <td>1.778576e+07</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" c1 c2\n",
"0 -6.653632e+07 -9.564604e+06\n",
"1 -6.704580e+07 -9.898031e+06\n",
"2 -6.712784e+07 -9.875840e+06\n",
"3 -6.699982e+07 -9.782837e+06\n",
"4 -6.702599e+07 -9.829385e+06\n",
"... ... ...\n",
"631950 -6.712883e+07 -9.876545e+06\n",
"631951 -1.490810e+07 1.783457e+07\n",
"631952 2.098170e+07 6.844411e+07\n",
"631953 1.163357e+07 3.237585e+07\n",
"631954 -1.486282e+07 1.778576e+07\n",
"\n",
"[631955 rows x 2 columns]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Construimos nuestro nuevo conjunto de datos con 2 dimensiones\n",
"# Mantenemos los 631955 registros pero hemos transformado esas 79 columnas, características o atributos de entrada en \n",
"# 2 únicos atributos o columnas.\n",
"df_reduced = pd.DataFrame(df_reduced, columns=[\"c1\", \"c2\"])\n",
"df_reduced"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-6.653632e+07</td>\n",
" <td>-9.564604e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-6.704580e+07</td>\n",
" <td>-9.898031e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-6.712784e+07</td>\n",
" <td>-9.875840e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-6.699982e+07</td>\n",
" <td>-9.782837e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-6.702599e+07</td>\n",
" <td>-9.829385e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>-6.691581e+07</td>\n",
" <td>-9.791010e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>-6.713106e+07</td>\n",
" <td>-9.880533e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>-6.709856e+07</td>\n",
" <td>-9.884049e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>-6.610628e+07</td>\n",
" <td>-9.807246e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>-6.705952e+07</td>\n",
" <td>-9.848264e+06</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" c1 c2\n",
"0 -6.653632e+07 -9.564604e+06\n",
"1 -6.704580e+07 -9.898031e+06\n",
"2 -6.712784e+07 -9.875840e+06\n",
"3 -6.699982e+07 -9.782837e+06\n",
"4 -6.702599e+07 -9.829385e+06\n",
"5 -6.691581e+07 -9.791010e+06\n",
"6 -6.713106e+07 -9.880533e+06\n",
"7 -6.709856e+07 -9.884049e+06\n",
"8 -6.610628e+07 -9.807246e+06\n",
"9 -6.705952e+07 -9.848264e+06"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 10 primeros registros\n",
"df_reduced.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representamos el nuevo conjunto de datos con dos características de entrada y tres categorías (y)**"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"plt.plot(df_reduced[\"c1\"][y_df==0], df_reduced[\"c2\"][y_df==0], \"yo\", label=\"normal\")\n",
"plt.plot(df_reduced[\"c1\"][y_df==1], df_reduced[\"c2\"][y_df==1], \"bs\", label=\"adware\")\n",
"plt.plot(df_reduced[\"c1\"][y_df==2], df_reduced[\"c2\"][y_df==2], \"g^\", label=\"malware\")\n",
"plt.xlabel(\"c1\", fontsize=15)\n",
"plt.ylabel(\"c2\", fontsize=15, rotation=0)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.91695209, 0.05610877])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Calculamos la proporción de varianza que se ha preservado del conjunto original\n",
"pca.explained_variance_ratio_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El resultado anterior nos indica que el 91,6% de la varianza del conjunto de datos original se mantiene en el primer eje, y el 5,6% en el segundo eje. Esto quiere decir que aproximadamente 2,8% de la varianza se mantiene en el resto de ejes que no se han utilizado para construir el nuevo conjunto, por lo tanto, es razonable pensar que el resto de características del conjunto de datos no aportaban demasiada información."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Representamos el límite de decisión que generaría un algoritmo en este nuevo conjunto de datos reducido**"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
" max_depth=3, max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort='deprecated',\n",
" random_state=42, splitter='best')"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Generamos un modelo con el conjunto de datos reducido\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"clf_tree_reduced = DecisionTreeClassifier(max_depth=3, random_state=42)\n",
"clf_tree_reduced.fit(df_reduced, y_df)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Representamos el límite de decisión generado por el modelo\n",
"from matplotlib.colors import ListedColormap\n",
"\n",
"def plot_decision_boundary(clf, X, y, plot_training=True, resolution=1000):\n",
" mins = X.min(axis=0) - 1\n",
" maxs = X.max(axis=0) + 1\n",
" x1, x2 = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),\n",
" np.linspace(mins[1], maxs[1], resolution))\n",
" X_new = np.c_[x1.ravel(), x2.ravel()]\n",
" y_pred = clf.predict(X_new).reshape(x1.shape)\n",
" custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])\n",
" plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)\n",
" custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])\n",
" plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)\n",
" if plot_training:\n",
" plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"yo\", label=\"normal\")\n",
" plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"bs\", label=\"adware\")\n",
" plt.plot(X[:, 0][y==2], X[:, 1][y==2], \"g^\", label=\"malware\")\n",
" plt.axis([mins[0], maxs[0], mins[1], maxs[1]]) \n",
" plt.xlabel(r\"$x_1$\", fontsize=18)\n",
" plt.ylabel(r\"$x_2$\", fontsize=18, rotation=0)\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"plot_decision_boundary(clf_tree_reduced, df_reduced.values, y_df)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Si nuestro objetivo no es visualizar el conjunto de datos, sino reducir la dimensionalidad del conjunto de datos original, en lugar de seleccionar arbitrariamente el numero de dimensiones, sklearn nos proporciona un mecanismo para seleccionar aquellas dimensiones que mantienen un determinado porcetaje de varianza."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# Reducimos el conjunto de datos manteniendo el 99,9% de varianza\n",
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=0.999)\n",
"df_reduced = pca.fit_transform(X_df)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de componentes: 6\n"
]
}
],
"source": [
"# Numero de dimensionaes del nuevo conjunto\n",
"print(\"Número de componentes:\", pca.n_components_)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([9.16952089e-01, 5.61087653e-02, 2.16566915e-02, 3.65011318e-03,\n",
" 5.56686331e-04, 3.79356201e-04])"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Calculamos la proporción de varianza que se ha preservado del conjunto original\n",
"pca.explained_variance_ratio_"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c3</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-6.653632e+07</td>\n",
" <td>-9.564604e+06</td>\n",
" <td>3.437284e+06</td>\n",
" <td>-2.949219e+06</td>\n",
" <td>1.822415e+06</td>\n",
" <td>-1.049114e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-6.704580e+07</td>\n",
" <td>-9.898031e+06</td>\n",
" <td>3.424601e+06</td>\n",
" <td>-3.127607e+06</td>\n",
" <td>2.800781e+06</td>\n",
" <td>-1.063954e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-6.712784e+07</td>\n",
" <td>-9.875840e+06</td>\n",
" <td>3.461085e+06</td>\n",
" <td>-3.118886e+06</td>\n",
" <td>2.823975e+06</td>\n",
" <td>-1.022121e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-6.699982e+07</td>\n",
" <td>-9.782837e+06</td>\n",
" <td>3.436564e+06</td>\n",
" <td>-3.051254e+06</td>\n",
" <td>2.557188e+06</td>\n",
" <td>-9.126112e+05</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-6.702599e+07</td>\n",
" <td>-9.829385e+06</td>\n",
" <td>3.484764e+06</td>\n",
" <td>-3.108501e+06</td>\n",
" <td>2.726738e+06</td>\n",
" <td>-1.074407e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>-6.712883e+07</td>\n",
" <td>-9.876545e+06</td>\n",
" <td>3.460169e+06</td>\n",
" <td>-3.121068e+06</td>\n",
" <td>2.841346e+06</td>\n",
" <td>-1.034425e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>-1.490810e+07</td>\n",
" <td>1.783457e+07</td>\n",
" <td>3.127773e+06</td>\n",
" <td>7.128419e+06</td>\n",
" <td>-3.846691e+07</td>\n",
" <td>-8.013578e+06</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>2.098170e+07</td>\n",
" <td>6.844411e+07</td>\n",
" <td>-3.376382e+07</td>\n",
" <td>-2.849601e+07</td>\n",
" <td>2.306222e+06</td>\n",
" <td>-3.567339e+05</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>1.163357e+07</td>\n",
" <td>3.237585e+07</td>\n",
" <td>5.077168e+06</td>\n",
" <td>1.323917e+07</td>\n",
" <td>-5.531286e+07</td>\n",
" <td>-1.294719e+07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>-1.486282e+07</td>\n",
" <td>1.778576e+07</td>\n",
" <td>3.238090e+06</td>\n",
" <td>7.021148e+06</td>\n",
" <td>-3.809729e+07</td>\n",
" <td>-8.235398e+06</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" c1 c2 c3 c4 c5 \\\n",
"0 -6.653632e+07 -9.564604e+06 3.437284e+06 -2.949219e+06 1.822415e+06 \n",
"1 -6.704580e+07 -9.898031e+06 3.424601e+06 -3.127607e+06 2.800781e+06 \n",
"2 -6.712784e+07 -9.875840e+06 3.461085e+06 -3.118886e+06 2.823975e+06 \n",
"3 -6.699982e+07 -9.782837e+06 3.436564e+06 -3.051254e+06 2.557188e+06 \n",
"4 -6.702599e+07 -9.829385e+06 3.484764e+06 -3.108501e+06 2.726738e+06 \n",
"... ... ... ... ... ... \n",
"631950 -6.712883e+07 -9.876545e+06 3.460169e+06 -3.121068e+06 2.841346e+06 \n",
"631951 -1.490810e+07 1.783457e+07 3.127773e+06 7.128419e+06 -3.846691e+07 \n",
"631952 2.098170e+07 6.844411e+07 -3.376382e+07 -2.849601e+07 2.306222e+06 \n",
"631953 1.163357e+07 3.237585e+07 5.077168e+06 1.323917e+07 -5.531286e+07 \n",
"631954 -1.486282e+07 1.778576e+07 3.238090e+06 7.021148e+06 -3.809729e+07 \n",
"\n",
" c6 Class \n",
"0 -1.049114e+06 0 \n",
"1 -1.063954e+06 0 \n",
"2 -1.022121e+06 0 \n",
"3 -9.126112e+05 0 \n",
"4 -1.074407e+06 0 \n",
"... ... ... \n",
"631950 -1.034425e+06 0 \n",
"631951 -8.013578e+06 2 \n",
"631952 -3.567339e+05 1 \n",
"631953 -1.294719e+07 0 \n",
"631954 -8.235398e+06 2 \n",
"\n",
"[631955 rows x 7 columns]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Transformamos a un DataFrame de Pandas\n",
"df_reduced = pd.DataFrame(df_reduced, columns=[\"c1\", \"c2\", \"c3\", \"c4\", \"c5\", \"c6\"])\n",
"df_reduced[\"Class\"] = y_df\n",
"df_reduced"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">6. División del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# Dividimos el conjunto de datos\n",
"train_set, val_set, test_set = train_val_test_split(df_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c3</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>508881</th>\n",
" <td>-6.712885e+07</td>\n",
" <td>-9.876478e+06</td>\n",
" <td>3.460136e+06</td>\n",
" <td>-3.120779e+06</td>\n",
" <td>2.838979e+06</td>\n",
" <td>-1.032784e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>208326</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>107213</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>466726</th>\n",
" <td>-6.710449e+07</td>\n",
" <td>-9.882485e+06</td>\n",
" <td>3.449502e+06</td>\n",
" <td>-3.121323e+06</td>\n",
" <td>2.815688e+06</td>\n",
" <td>-1.033614e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230085</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110268</th>\n",
" <td>-4.875034e+07</td>\n",
" <td>-1.097862e+07</td>\n",
" <td>3.460739e+06</td>\n",
" <td>-3.015505e+06</td>\n",
" <td>1.066143e+05</td>\n",
" <td>-2.308220e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>259178</th>\n",
" <td>3.783083e+07</td>\n",
" <td>-2.111494e+07</td>\n",
" <td>1.301973e+04</td>\n",
" <td>-3.084080e+06</td>\n",
" <td>2.770827e+06</td>\n",
" <td>-1.000296e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>365838</th>\n",
" <td>-6.706623e+07</td>\n",
" <td>-9.836596e+06</td>\n",
" <td>3.462837e+06</td>\n",
" <td>-3.084332e+06</td>\n",
" <td>2.701442e+06</td>\n",
" <td>-9.492921e+05</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131932</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875909e+06</td>\n",
" <td>3.459959e+06</td>\n",
" <td>-3.118507e+06</td>\n",
" <td>2.820990e+06</td>\n",
" <td>-1.020006e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121958</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>379173 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" c1 c2 c3 c4 c5 \\\n",
"508881 -6.712885e+07 -9.876478e+06 3.460136e+06 -3.120779e+06 2.838979e+06 \n",
"208326 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"107213 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"466726 -6.710449e+07 -9.882485e+06 3.449502e+06 -3.121323e+06 2.815688e+06 \n",
"230085 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"... ... ... ... ... ... \n",
"110268 -4.875034e+07 -1.097862e+07 3.460739e+06 -3.015505e+06 1.066143e+05 \n",
"259178 3.783083e+07 -2.111494e+07 1.301973e+04 -3.084080e+06 2.770827e+06 \n",
"365838 -6.706623e+07 -9.836596e+06 3.462837e+06 -3.084332e+06 2.701442e+06 \n",
"131932 -6.712918e+07 -9.875909e+06 3.459959e+06 -3.118507e+06 2.820990e+06 \n",
"121958 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"\n",
" c6 Class \n",
"508881 -1.032784e+06 0 \n",
"208326 -1.020321e+06 0 \n",
"107213 -1.020321e+06 0 \n",
"466726 -1.033614e+06 0 \n",
"230085 -1.020321e+06 0 \n",
"... ... ... \n",
"110268 -2.308220e+06 0 \n",
"259178 -1.000296e+06 0 \n",
"365838 -9.492921e+05 0 \n",
"131932 -1.020006e+06 0 \n",
"121958 -1.020321e+06 0 \n",
"\n",
"[379173 rows x 7 columns]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_set"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c3</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>240832</th>\n",
" <td>-3.522534e+07</td>\n",
" <td>-1.329271e+07</td>\n",
" <td>2.411139e+06</td>\n",
" <td>-3.109766e+06</td>\n",
" <td>2.811310e+06</td>\n",
" <td>-1.017766e+06</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>326539</th>\n",
" <td>-6.701093e+07</td>\n",
" <td>-9.907277e+06</td>\n",
" <td>3.409810e+06</td>\n",
" <td>-3.131379e+06</td>\n",
" <td>2.792243e+06</td>\n",
" <td>-1.082202e+06</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200606</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>431142</th>\n",
" <td>-4.256612e+07</td>\n",
" <td>-2.354263e+06</td>\n",
" <td>1.153506e+07</td>\n",
" <td>-3.312392e+06</td>\n",
" <td>2.921617e+06</td>\n",
" <td>-1.167700e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>478100</th>\n",
" <td>2.235887e+07</td>\n",
" <td>1.898609e+07</td>\n",
" <td>2.733221e+07</td>\n",
" <td>3.696151e+06</td>\n",
" <td>-5.782130e+06</td>\n",
" <td>1.849395e+07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>215540</th>\n",
" <td>-3.968974e+07</td>\n",
" <td>-1.281458e+07</td>\n",
" <td>2.557897e+06</td>\n",
" <td>-3.110930e+06</td>\n",
" <td>2.812511e+06</td>\n",
" <td>-1.017843e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>516620</th>\n",
" <td>-6.713350e+07</td>\n",
" <td>-9.886325e+06</td>\n",
" <td>3.469475e+06</td>\n",
" <td>-3.171433e+06</td>\n",
" <td>3.265767e+06</td>\n",
" <td>-1.322324e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>592495</th>\n",
" <td>-6.696096e+07</td>\n",
" <td>-9.837763e+06</td>\n",
" <td>3.533368e+06</td>\n",
" <td>-3.174420e+06</td>\n",
" <td>2.691177e+06</td>\n",
" <td>-1.082761e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279808</th>\n",
" <td>7.514034e+07</td>\n",
" <td>1.065070e+08</td>\n",
" <td>-4.535891e+07</td>\n",
" <td>3.717859e+07</td>\n",
" <td>2.080574e+06</td>\n",
" <td>6.265738e+06</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34456</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>126391 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" c1 c2 c3 c4 c5 \\\n",
"240832 -3.522534e+07 -1.329271e+07 2.411139e+06 -3.109766e+06 2.811310e+06 \n",
"326539 -6.701093e+07 -9.907277e+06 3.409810e+06 -3.131379e+06 2.792243e+06 \n",
"200606 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"431142 -4.256612e+07 -2.354263e+06 1.153506e+07 -3.312392e+06 2.921617e+06 \n",
"478100 2.235887e+07 1.898609e+07 2.733221e+07 3.696151e+06 -5.782130e+06 \n",
"... ... ... ... ... ... \n",
"215540 -3.968974e+07 -1.281458e+07 2.557897e+06 -3.110930e+06 2.812511e+06 \n",
"516620 -6.713350e+07 -9.886325e+06 3.469475e+06 -3.171433e+06 3.265767e+06 \n",
"592495 -6.696096e+07 -9.837763e+06 3.533368e+06 -3.174420e+06 2.691177e+06 \n",
"279808 7.514034e+07 1.065070e+08 -4.535891e+07 3.717859e+07 2.080574e+06 \n",
"34456 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"\n",
" c6 Class \n",
"240832 -1.017766e+06 1 \n",
"326539 -1.082202e+06 1 \n",
"200606 -1.020321e+06 0 \n",
"431142 -1.167700e+06 0 \n",
"478100 1.849395e+07 0 \n",
"... ... ... \n",
"215540 -1.017843e+06 0 \n",
"516620 -1.322324e+06 0 \n",
"592495 -1.082761e+06 0 \n",
"279808 6.265738e+06 1 \n",
"34456 -1.020321e+06 0 \n",
"\n",
"[126391 rows x 7 columns]"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val_set"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c3</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>541262</th>\n",
" <td>-4.331457e+07</td>\n",
" <td>-6.688204e+06</td>\n",
" <td>1.172307e+07</td>\n",
" <td>-3.208084e+06</td>\n",
" <td>-8.829879e+06</td>\n",
" <td>-4.193189e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>335127</th>\n",
" <td>-6.712687e+07</td>\n",
" <td>-9.876602e+06</td>\n",
" <td>3.459028e+06</td>\n",
" <td>-3.119139e+06</td>\n",
" <td>2.823132e+06</td>\n",
" <td>-1.023254e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>217235</th>\n",
" <td>-6.712918e+07</td>\n",
" <td>-9.875923e+06</td>\n",
" <td>3.459967e+06</td>\n",
" <td>-3.118586e+06</td>\n",
" <td>2.821194e+06</td>\n",
" <td>-1.020321e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230087</th>\n",
" <td>8.906373e+07</td>\n",
" <td>1.181202e+08</td>\n",
" <td>-4.992390e+07</td>\n",
" <td>3.903072e+07</td>\n",
" <td>3.441380e+06</td>\n",
" <td>1.771358e+06</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>493773</th>\n",
" <td>-6.631246e+07</td>\n",
" <td>-9.493603e+06</td>\n",
" <td>3.510767e+06</td>\n",
" <td>-2.933745e+06</td>\n",
" <td>1.639827e+06</td>\n",
" <td>-1.197380e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>234003</th>\n",
" <td>-6.694606e+07</td>\n",
" <td>-9.805741e+06</td>\n",
" <td>3.489556e+06</td>\n",
" <td>-3.071515e+06</td>\n",
" <td>2.537482e+06</td>\n",
" <td>-9.724698e+05</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>521896</th>\n",
" <td>-6.713259e+07</td>\n",
" <td>-9.884166e+06</td>\n",
" <td>3.467482e+06</td>\n",
" <td>-3.160482e+06</td>\n",
" <td>3.171946e+06</td>\n",
" <td>-1.259239e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>608911</th>\n",
" <td>2.166144e+08</td>\n",
" <td>2.168474e+08</td>\n",
" <td>-9.152850e+07</td>\n",
" <td>8.083598e+07</td>\n",
" <td>-7.069317e+06</td>\n",
" <td>3.165290e+07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>234796</th>\n",
" <td>-5.257914e+07</td>\n",
" <td>-5.220519e+06</td>\n",
" <td>7.859933e+06</td>\n",
" <td>-2.671070e+06</td>\n",
" <td>1.963959e+06</td>\n",
" <td>6.151725e+05</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>225327</th>\n",
" <td>-6.676682e+07</td>\n",
" <td>-9.859577e+06</td>\n",
" <td>3.724604e+06</td>\n",
" <td>-3.179077e+06</td>\n",
" <td>2.615899e+06</td>\n",
" <td>-1.184431e+06</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>126391 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" c1 c2 c3 c4 c5 \\\n",
"541262 -4.331457e+07 -6.688204e+06 1.172307e+07 -3.208084e+06 -8.829879e+06 \n",
"335127 -6.712687e+07 -9.876602e+06 3.459028e+06 -3.119139e+06 2.823132e+06 \n",
"217235 -6.712918e+07 -9.875923e+06 3.459967e+06 -3.118586e+06 2.821194e+06 \n",
"230087 8.906373e+07 1.181202e+08 -4.992390e+07 3.903072e+07 3.441380e+06 \n",
"493773 -6.631246e+07 -9.493603e+06 3.510767e+06 -2.933745e+06 1.639827e+06 \n",
"... ... ... ... ... ... \n",
"234003 -6.694606e+07 -9.805741e+06 3.489556e+06 -3.071515e+06 2.537482e+06 \n",
"521896 -6.713259e+07 -9.884166e+06 3.467482e+06 -3.160482e+06 3.171946e+06 \n",
"608911 2.166144e+08 2.168474e+08 -9.152850e+07 8.083598e+07 -7.069317e+06 \n",
"234796 -5.257914e+07 -5.220519e+06 7.859933e+06 -2.671070e+06 1.963959e+06 \n",
"225327 -6.676682e+07 -9.859577e+06 3.724604e+06 -3.179077e+06 2.615899e+06 \n",
"\n",
" c6 Class \n",
"541262 -4.193189e+06 0 \n",
"335127 -1.023254e+06 0 \n",
"217235 -1.020321e+06 0 \n",
"230087 1.771358e+06 1 \n",
"493773 -1.197380e+06 0 \n",
"... ... ... \n",
"234003 -9.724698e+05 1 \n",
"521896 -1.259239e+06 0 \n",
"608911 3.165290e+07 0 \n",
"234796 6.151725e+05 1 \n",
"225327 -1.184431e+06 0 \n",
"\n",
"[126391 rows x 7 columns]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_set"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# Separando las características de entrada de la de salida\n",
"X_train, y_train = remove_labels(train_set, 'Class')\n",
"X_val, y_val = remove_labels(val_set, 'Class')\n",
"X_test, y_test = remove_labels(test_set, 'Class')"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del Training Set: 379173\n",
"Longitud del Validation Set: 126391\n",
"Longitud del Test Set: 126391\n"
]
}
],
"source": [
"print(\"Longitud del Training Set:\", len(train_set))\n",
"print(\"Longitud del Validation Set:\", len(val_set))\n",
"print(\"Longitud del Test Set:\", len(test_set))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">7. Random Forests</h2>"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=30, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=200,\n",
" n_jobs=-1, oob_score=False, random_state=42, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# max_depth=30 -> profundidad máxima de 30 ramas\n",
"clf_rnd = RandomForestClassifier(n_estimators=200, max_depth=30, random_state=42, n_jobs=-1)\n",
"clf_rnd.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 0, 0, ..., 0, 1, 0], dtype=int64)"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Predecimos con el conjunto de datos de validación\n",
"y_val_pred = clf_rnd.predict(X_val)\n",
"y_val_pred"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 score validation test: 0.8914140064148489\n"
]
}
],
"source": [
"# F1 score conjunto de datos de validación\n",
"print(\"F1 score validation test:\", f1_score(y_val_pred, y_val, average='weighted'))"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"# Predecimos con el conjunto de datos de pruebas\n",
"y_test_pred = clf_rnd.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 score test set: 0.8945365648002198\n"
]
}
],
"source": [
"# F1 score conjunto de datos de pruebas\n",
"print(\"F1 score test set:\", f1_score(y_test_pred, y_test, average='weighted'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment