Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save josejesus30/b461fecb5a76be952290daea294e7f3b to your computer and use it in GitHub Desktop.
Save josejesus30/b461fecb5a76be952290daea294e7f3b to your computer and use it in GitHub Desktop.
En este caso de uso práctico se presenta un mecanismo de selección de características mediante el uso de Random Forest.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modelo: _Random Forest para selección de características_\n",
"En este caso de uso práctico se presenta un mecanismo de selección de características mediante el uso de Random Forest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Autor: José Alamo Palomino"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Caso Práctico: Detección de malware en Android"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El sofisticado y avanzado malware de Android puede identificar la presencia del emulador utilizado por el analista de malware y, en respuesta, alterar su comportamiento para evadir la detección. Para superar este problema, instalamos las aplicaciones de Android en el dispositivo real y capturamos su tráfico de red. Vea nuestro Sandbox de Android disponible al público .\n",
"\n",
"El conjunto de datos CICAAGM se captura instalando las aplicaciones de Android en los teléfonos inteligentes reales semiautomatizados. El conjunto de datos se genera a partir de 1900 aplicaciones con las siguientes tres categorías:\n",
"\n",
"### 1. Adware (250 aplicaciones)\n",
"\n",
"* **Airpush:** diseñado para entregar anuncios no solicitados a los sistemas del usuario para el robo de información.\n",
"\n",
"* **Dowgin:** diseñado como una biblioteca de publicidad que también puede robar la información del usuario.\n",
"\n",
"* **Kemoge:** diseñado para hacerse cargo del dispositivo Android de un usuario. Este adware es un híbrido de botnet y se disfraza de aplicaciones populares a través del reempaquetado.\n",
"\n",
"* **Mobidash:** diseñado para mostrar anuncios y comprometer la información personal del usuario.\n",
"\n",
"* **Shuanet:** similar a Kemoge, Shuanet también está diseñado para hacerse cargo del dispositivo de un usuario.\n",
"\n",
"### 2. Malware general (150 aplicaciones)\n",
"\n",
"* **AVpass:** diseñado para ser distribuido bajo la apariencia de una aplicación de reloj.\n",
"\n",
"* **FakeAV:** Diseñado como una estafa que engaña al usuario para que compre una versión completa del software con el fin de mediar infecciones no existentes.\n",
"\n",
"* **FakeFlash / FakePlayer:** diseñado como una aplicación Flash falsa para dirigir a los usuarios a un sitio web (después de una instalación exitosa).\n",
"\n",
"* **GGtracker:** diseñado para el fraude por SMS (envía mensajes SMS a un número de tarifa premium) y robo de información.\n",
"\n",
"* **Penetho:** diseñado como un servicio falso (hacktool para dispositivos Android que se puede usar para descifrar la contraseña de WiFi). El malware también puede infectar la computadora del usuario a través de archivos adjuntos de correo electrónico infectados, actualizaciones falsas, medios externos y documentos infectados.\n",
"\n",
"### 3. Benigno (1500 aplicaciones)\n",
"\n",
"* 2015 GooglePlay market (top gratis popular y top gratis nuevo)\n",
"* 2016 GooglePlay market (top gratis popular y top gratis nuevo)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observación:\n",
"El conjunto de datos esta formado mendiante la instalación de aplicaciones en dispositivos android, y una vez que se han instalado esas aplicaciones lo que se hace es capturar el tráfico de red que generan esos dispositivos.\n",
"\n",
"EL objetivo es entrenar un algoritmo de Random Forest que sea capaz de diferenciar flujos de tráfico de red pertenecientes a las clases benign, asware y GeneralMalware y que cuando llegue tráfico de red nuevo, nuestro algoritmo de random forest devolverá alguna de las 3 categorias mencionadas. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">1. Importando librerías necesarias</h2>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import RobustScaler\n",
"from sklearn.metrics import f1_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">2. Funciones auxiliares</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Función para la partición del conjunto de datos"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):\n",
" strat = df[stratify] if stratify else None\n",
" train_set, test_set = train_test_split(\n",
" df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" strat = test_set[stratify] if stratify else None\n",
" val_set, test_set = train_test_split(\n",
" test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)\n",
" return (train_set, val_set, test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Función para separar las características de entrada de las de salida"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def remove_labels(df, label_name):\n",
" X = df.drop(label_name, axis=1)\n",
" y = df[label_name].copy()\n",
" return (X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">3. Lectura del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631950</th>\n",
" <td>530</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74</td>\n",
" <td>334</td>\n",
" <td>74.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631951</th>\n",
" <td>50240627</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" <td>4767</td>\n",
" <td>6107</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>207.260870</td>\n",
" <td>...</td>\n",
" <td>9842879.0</td>\n",
" <td>9964749</td>\n",
" <td>1.196806e+05</td>\n",
" <td>2</td>\n",
" <td>317952</td>\n",
" <td>107008</td>\n",
" <td>11</td>\n",
" <td>23</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631952</th>\n",
" <td>35471450</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>52</td>\n",
" <td>104</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>35300000.0</td>\n",
" <td>35290631</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2</td>\n",
" <td>3904</td>\n",
" <td>88704</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>32</td>\n",
" <td>asware</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631953</th>\n",
" <td>41713629</td>\n",
" <td>12</td>\n",
" <td>26</td>\n",
" <td>1821</td>\n",
" <td>18643</td>\n",
" <td>40</td>\n",
" <td>40</td>\n",
" <td>489</td>\n",
" <td>1390</td>\n",
" <td>151.750000</td>\n",
" <td>...</td>\n",
" <td>20200000.0</td>\n",
" <td>32711382</td>\n",
" <td>1.770000e+07</td>\n",
" <td>2</td>\n",
" <td>227456</td>\n",
" <td>2432</td>\n",
" <td>23</td>\n",
" <td>12</td>\n",
" <td>20</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>631954</th>\n",
" <td>50110119</td>\n",
" <td>20</td>\n",
" <td>23</td>\n",
" <td>4130</td>\n",
" <td>6043</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>533</td>\n",
" <td>855</td>\n",
" <td>206.500000</td>\n",
" <td>...</td>\n",
" <td>9873329.4</td>\n",
" <td>9906007</td>\n",
" <td>4.737363e+04</td>\n",
" <td>2</td>\n",
" <td>266112</td>\n",
" <td>59904</td>\n",
" <td>11</td>\n",
" <td>20</td>\n",
" <td>32</td>\n",
" <td>GeneralMalware</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>631955 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"... ... ... ... ... ... \n",
"631950 530 1 1 74 334 \n",
"631951 50240627 23 24 4767 6107 \n",
"631952 35471450 1 2 52 104 \n",
"631953 41713629 12 26 1821 18643 \n",
"631954 50110119 20 23 4130 6043 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... \\\n",
"0 52 52 679 1390 53.431138 ... \n",
"1 75 124 75 124 75.000000 ... \n",
"2 52 -1 83 -1 62.333333 ... \n",
"3 52 52 706 1390 160.888889 ... \n",
"4 52 52 331 1005 132.000000 ... \n",
"... ... ... ... ... ... ... \n",
"631950 74 334 74 334 74.000000 ... \n",
"631951 52 52 533 855 207.260870 ... \n",
"631952 52 52 52 52 52.000000 ... \n",
"631953 40 40 489 1390 151.750000 ... \n",
"631954 52 52 533 855 206.500000 ... \n",
"\n",
" mean_idle max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 0.0 -1 0.000000e+00 2 4194240 \n",
"1 0.0 -1 0.000000e+00 2 0 \n",
"2 0.0 -1 0.000000e+00 4 101888 \n",
"3 0.0 -1 0.000000e+00 2 4194240 \n",
"4 0.0 -1 0.000000e+00 2 155136 \n",
"... ... ... ... ... ... \n",
"631950 0.0 -1 0.000000e+00 2 0 \n",
"631951 9842879.0 9964749 1.196806e+05 2 317952 \n",
"631952 35300000.0 35290631 0.000000e+00 2 3904 \n",
"631953 20200000.0 32711382 1.770000e+07 2 227456 \n",
"631954 9873329.4 9906007 4.737363e+04 2 266112 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"... ... ... ... \n",
"631950 0 0 1 \n",
"631951 107008 11 23 \n",
"631952 88704 1 1 \n",
"631953 2432 23 12 \n",
"631954 59904 11 20 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"... ... ... \n",
"631950 0 benign \n",
"631951 32 GeneralMalware \n",
"631952 32 asware \n",
"631953 20 benign \n",
"631954 32 GeneralMalware \n",
"\n",
"[631955 rows x 80 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('Datasets/TotalFeatures-ISCXFlowMeter.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <h2 style=\"color:blue\">4. Visualización del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" <th>calss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1020586</td>\n",
" <td>668</td>\n",
" <td>1641</td>\n",
" <td>35692</td>\n",
" <td>2276876</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>679</td>\n",
" <td>1390</td>\n",
" <td>53.431138</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>1853440</td>\n",
" <td>1640</td>\n",
" <td>668</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80794</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75</td>\n",
" <td>124</td>\n",
" <td>75.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>998</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>187</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>62.333333</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>101888</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>189868</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>1448</td>\n",
" <td>6200</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>706</td>\n",
" <td>1390</td>\n",
" <td>160.888889</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>2722560</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>110577</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>528</td>\n",
" <td>1422</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>331</td>\n",
" <td>1005</td>\n",
" <td>132.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>155136</td>\n",
" <td>31232</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>261876</td>\n",
" <td>7</td>\n",
" <td>6</td>\n",
" <td>1618</td>\n",
" <td>882</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>730</td>\n",
" <td>477</td>\n",
" <td>231.142857</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>4194240</td>\n",
" <td>926720</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>14</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>104</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>5824</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>29675</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71</td>\n",
" <td>213</td>\n",
" <td>71.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>806635</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>239</td>\n",
" <td>0</td>\n",
" <td>52</td>\n",
" <td>-1</td>\n",
" <td>83</td>\n",
" <td>-1</td>\n",
" <td>59.750000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>107008</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>56620</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1074</td>\n",
" <td>719</td>\n",
" <td>52</td>\n",
" <td>52</td>\n",
" <td>592</td>\n",
" <td>667</td>\n",
" <td>358.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>-1</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>128512</td>\n",
" <td>10816</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>32</td>\n",
" <td>benign</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl total_bpktl \\\n",
"0 1020586 668 1641 35692 2276876 \n",
"1 80794 1 1 75 124 \n",
"2 998 3 0 187 0 \n",
"3 189868 9 9 1448 6200 \n",
"4 110577 4 6 528 1422 \n",
"5 261876 7 6 1618 882 \n",
"6 14 2 0 104 0 \n",
"7 29675 1 1 71 213 \n",
"8 806635 4 0 239 0 \n",
"9 56620 3 2 1074 719 \n",
"\n",
" min_fpktl min_bpktl max_fpktl max_bpktl mean_fpktl ... mean_idle \\\n",
"0 52 52 679 1390 53.431138 ... 0.0 \n",
"1 75 124 75 124 75.000000 ... 0.0 \n",
"2 52 -1 83 -1 62.333333 ... 0.0 \n",
"3 52 52 706 1390 160.888889 ... 0.0 \n",
"4 52 52 331 1005 132.000000 ... 0.0 \n",
"5 52 52 730 477 231.142857 ... 0.0 \n",
"6 52 -1 52 -1 52.000000 ... 0.0 \n",
"7 71 213 71 213 71.000000 ... 0.0 \n",
"8 52 -1 83 -1 59.750000 ... 0.0 \n",
"9 52 52 592 667 358.000000 ... 0.0 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"0 -1 0.0 2 4194240 \n",
"1 -1 0.0 2 0 \n",
"2 -1 0.0 4 101888 \n",
"3 -1 0.0 2 4194240 \n",
"4 -1 0.0 2 155136 \n",
"5 -1 0.0 2 4194240 \n",
"6 -1 0.0 3 5824 \n",
"7 -1 0.0 2 0 \n",
"8 -1 0.0 5 107008 \n",
"9 -1 0.0 3 128512 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"0 1853440 1640 668 \n",
"1 0 0 1 \n",
"2 -1 0 3 \n",
"3 2722560 8 9 \n",
"4 31232 5 4 \n",
"5 926720 3 7 \n",
"6 -1 0 2 \n",
"7 0 0 1 \n",
"8 -1 0 4 \n",
"9 10816 1 3 \n",
"\n",
" min_seg_size_forward calss \n",
"0 32 benign \n",
"1 0 benign \n",
"2 32 benign \n",
"3 32 benign \n",
"4 32 benign \n",
"5 32 benign \n",
"6 32 benign \n",
"7 0 benign \n",
"8 32 benign \n",
"9 32 benign \n",
"\n",
"[10 rows x 80 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration</th>\n",
" <th>total_fpackets</th>\n",
" <th>total_bpackets</th>\n",
" <th>total_fpktl</th>\n",
" <th>total_bpktl</th>\n",
" <th>min_fpktl</th>\n",
" <th>min_bpktl</th>\n",
" <th>max_fpktl</th>\n",
" <th>max_bpktl</th>\n",
" <th>mean_fpktl</th>\n",
" <th>...</th>\n",
" <th>min_idle</th>\n",
" <th>mean_idle</th>\n",
" <th>max_idle</th>\n",
" <th>std_idle</th>\n",
" <th>FFNEPD</th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>RRT_samples_clnt</th>\n",
" <th>Act_data_pkt_forward</th>\n",
" <th>min_seg_size_forward</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.000000</td>\n",
" <td>...</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>6.319550e+05</td>\n",
" <td>6.319550e+05</td>\n",
" <td>631955.000000</td>\n",
" <td>631955.00000</td>\n",
" <td>631955.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2.195245e+07</td>\n",
" <td>6.728514</td>\n",
" <td>10.431934</td>\n",
" <td>9.540172e+02</td>\n",
" <td>1.206042e+04</td>\n",
" <td>141.475727</td>\n",
" <td>44.357688</td>\n",
" <td>263.675901</td>\n",
" <td>183.248084</td>\n",
" <td>174.959706</td>\n",
" <td>...</td>\n",
" <td>1.997327e+07</td>\n",
" <td>2.031228e+07</td>\n",
" <td>2.075238e+07</td>\n",
" <td>4.663875e+05</td>\n",
" <td>2.360896</td>\n",
" <td>9.620796e+05</td>\n",
" <td>3.104519e+05</td>\n",
" <td>9.733144</td>\n",
" <td>6.72471</td>\n",
" <td>19.965713</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.900578e+08</td>\n",
" <td>174.161354</td>\n",
" <td>349.424019</td>\n",
" <td>8.235040e+04</td>\n",
" <td>4.824716e+05</td>\n",
" <td>157.680880</td>\n",
" <td>89.099554</td>\n",
" <td>289.644383</td>\n",
" <td>371.863224</td>\n",
" <td>162.024811</td>\n",
" <td>...</td>\n",
" <td>1.897986e+08</td>\n",
" <td>1.897902e+08</td>\n",
" <td>1.899721e+08</td>\n",
" <td>6.199704e+06</td>\n",
" <td>3.041810</td>\n",
" <td>1.705655e+06</td>\n",
" <td>6.647956e+05</td>\n",
" <td>347.877923</td>\n",
" <td>174.13813</td>\n",
" <td>14.914261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-1.800000e+01</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000e+00</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>6.900000e+01</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2.445000e+04</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.840000e+02</td>\n",
" <td>0.000000e+00</td>\n",
" <td>52.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>83.000000</td>\n",
" <td>...</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>8.761600e+04</td>\n",
" <td>-1.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>1.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.759751e+06</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>4.270000e+02</td>\n",
" <td>1.670000e+02</td>\n",
" <td>108.000000</td>\n",
" <td>52.000000</td>\n",
" <td>421.000000</td>\n",
" <td>115.000000</td>\n",
" <td>356.000000</td>\n",
" <td>...</td>\n",
" <td>1.013498e+06</td>\n",
" <td>1.291379e+06</td>\n",
" <td>1.306116e+06</td>\n",
" <td>0.000000e+00</td>\n",
" <td>2.000000</td>\n",
" <td>3.046400e+05</td>\n",
" <td>9.049600e+04</td>\n",
" <td>1.000000</td>\n",
" <td>3.00000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>4.431076e+10</td>\n",
" <td>48255.000000</td>\n",
" <td>74768.000000</td>\n",
" <td>4.049644e+07</td>\n",
" <td>1.039222e+08</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1500.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>1390.000000</td>\n",
" <td>...</td>\n",
" <td>4.431072e+10</td>\n",
" <td>4.430000e+10</td>\n",
" <td>4.431072e+10</td>\n",
" <td>8.470000e+08</td>\n",
" <td>2269.000000</td>\n",
" <td>4.194240e+06</td>\n",
" <td>4.194240e+06</td>\n",
" <td>74524.000000</td>\n",
" <td>48255.00000</td>\n",
" <td>44.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 79 columns</p>\n",
"</div>"
],
"text/plain": [
" duration total_fpackets total_bpackets total_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 6.319550e+05 \n",
"mean 2.195245e+07 6.728514 10.431934 9.540172e+02 \n",
"std 1.900578e+08 174.161354 349.424019 8.235040e+04 \n",
"min -1.800000e+01 0.000000 0.000000 0.000000e+00 \n",
"25% 0.000000e+00 1.000000 0.000000 6.900000e+01 \n",
"50% 2.445000e+04 1.000000 0.000000 1.840000e+02 \n",
"75% 1.759751e+06 3.000000 1.000000 4.270000e+02 \n",
"max 4.431076e+10 48255.000000 74768.000000 4.049644e+07 \n",
"\n",
" total_bpktl min_fpktl min_bpktl max_fpktl \\\n",
"count 6.319550e+05 631955.000000 631955.000000 631955.000000 \n",
"mean 1.206042e+04 141.475727 44.357688 263.675901 \n",
"std 4.824716e+05 157.680880 89.099554 289.644383 \n",
"min 0.000000e+00 -1.000000 -1.000000 -1.000000 \n",
"25% 0.000000e+00 52.000000 -1.000000 52.000000 \n",
"50% 0.000000e+00 52.000000 -1.000000 83.000000 \n",
"75% 1.670000e+02 108.000000 52.000000 421.000000 \n",
"max 1.039222e+08 1390.000000 1390.000000 1500.000000 \n",
"\n",
" max_bpktl mean_fpktl ... min_idle mean_idle \\\n",
"count 631955.000000 631955.000000 ... 6.319550e+05 6.319550e+05 \n",
"mean 183.248084 174.959706 ... 1.997327e+07 2.031228e+07 \n",
"std 371.863224 162.024811 ... 1.897986e+08 1.897902e+08 \n",
"min -1.000000 0.000000 ... -1.000000e+00 0.000000e+00 \n",
"25% -1.000000 52.000000 ... -1.000000e+00 0.000000e+00 \n",
"50% -1.000000 83.000000 ... -1.000000e+00 0.000000e+00 \n",
"75% 115.000000 356.000000 ... 1.013498e+06 1.291379e+06 \n",
"max 1390.000000 1390.000000 ... 4.431072e+10 4.430000e+10 \n",
"\n",
" max_idle std_idle FFNEPD Init_Win_bytes_forward \\\n",
"count 6.319550e+05 6.319550e+05 631955.000000 6.319550e+05 \n",
"mean 2.075238e+07 4.663875e+05 2.360896 9.620796e+05 \n",
"std 1.899721e+08 6.199704e+06 3.041810 1.705655e+06 \n",
"min -1.000000e+00 0.000000e+00 2.000000 -1.000000e+00 \n",
"25% -1.000000e+00 0.000000e+00 2.000000 0.000000e+00 \n",
"50% -1.000000e+00 0.000000e+00 2.000000 8.761600e+04 \n",
"75% 1.306116e+06 0.000000e+00 2.000000 3.046400e+05 \n",
"max 4.431072e+10 8.470000e+08 2269.000000 4.194240e+06 \n",
"\n",
" Init_Win_bytes_backward RRT_samples_clnt Act_data_pkt_forward \\\n",
"count 6.319550e+05 631955.000000 631955.00000 \n",
"mean 3.104519e+05 9.733144 6.72471 \n",
"std 6.647956e+05 347.877923 174.13813 \n",
"min -1.000000e+00 0.000000 0.00000 \n",
"25% -1.000000e+00 0.000000 1.00000 \n",
"50% -1.000000e+00 0.000000 1.00000 \n",
"75% 9.049600e+04 1.000000 3.00000 \n",
"max 4.194240e+06 74524.000000 48255.00000 \n",
"\n",
" min_seg_size_forward \n",
"count 631955.000000 \n",
"mean 19.965713 \n",
"std 14.914261 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 32.000000 \n",
"75% 32.000000 \n",
"max 44.000000 \n",
"\n",
"[8 rows x 79 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 631955 entries, 0 to 631954\n",
"Data columns (total 80 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 duration 631955 non-null int64 \n",
" 1 total_fpackets 631955 non-null int64 \n",
" 2 total_bpackets 631955 non-null int64 \n",
" 3 total_fpktl 631955 non-null int64 \n",
" 4 total_bpktl 631955 non-null int64 \n",
" 5 min_fpktl 631955 non-null int64 \n",
" 6 min_bpktl 631955 non-null int64 \n",
" 7 max_fpktl 631955 non-null int64 \n",
" 8 max_bpktl 631955 non-null int64 \n",
" 9 mean_fpktl 631955 non-null float64\n",
" 10 mean_bpktl 631955 non-null float64\n",
" 11 std_fpktl 631955 non-null float64\n",
" 12 std_bpktl 631955 non-null float64\n",
" 13 total_fiat 631955 non-null int64 \n",
" 14 total_biat 631955 non-null int64 \n",
" 15 min_fiat 631955 non-null int64 \n",
" 16 min_biat 631955 non-null int64 \n",
" 17 max_fiat 631955 non-null int64 \n",
" 18 max_biat 631955 non-null int64 \n",
" 19 mean_fiat 631955 non-null float64\n",
" 20 mean_biat 631955 non-null float64\n",
" 21 std_fiat 631955 non-null float64\n",
" 22 std_biat 631955 non-null float64\n",
" 23 fpsh_cnt 631955 non-null int64 \n",
" 24 bpsh_cnt 631955 non-null int64 \n",
" 25 furg_cnt 631955 non-null int64 \n",
" 26 burg_cnt 631955 non-null int64 \n",
" 27 total_fhlen 631955 non-null int64 \n",
" 28 total_bhlen 631955 non-null int64 \n",
" 29 fPktsPerSecond 631955 non-null float64\n",
" 30 bPktsPerSecond 631955 non-null float64\n",
" 31 flowPktsPerSecond 631955 non-null float64\n",
" 32 flowBytesPerSecond 631955 non-null float64\n",
" 33 min_flowpktl 631955 non-null int64 \n",
" 34 max_flowpktl 631955 non-null int64 \n",
" 35 mean_flowpktl 631955 non-null float64\n",
" 36 std_flowpktl 631955 non-null float64\n",
" 37 min_flowiat 631955 non-null int64 \n",
" 38 max_flowiat 631955 non-null int64 \n",
" 39 mean_flowiat 631955 non-null float64\n",
" 40 std_flowiat 631955 non-null float64\n",
" 41 flow_fin 631955 non-null int64 \n",
" 42 flow_syn 631955 non-null int64 \n",
" 43 flow_rst 631955 non-null int64 \n",
" 44 flow_psh 631955 non-null int64 \n",
" 45 flow_ack 631955 non-null int64 \n",
" 46 flow_urg 631955 non-null int64 \n",
" 47 flow_cwr 631955 non-null int64 \n",
" 48 flow_ece 631955 non-null int64 \n",
" 49 downUpRatio 631955 non-null float64\n",
" 50 avgPacketSize 631955 non-null float64\n",
" 51 fAvgSegmentSize 631955 non-null float64\n",
" 52 fHeaderBytes 631955 non-null int64 \n",
" 53 fAvgBytesPerBulk 631955 non-null int64 \n",
" 54 fAvgPacketsPerBulk 631955 non-null int64 \n",
" 55 fAvgBulkRate 631955 non-null int64 \n",
" 56 bVarianceDataBytes 631955 non-null float64\n",
" 57 bAvgSegmentSize 631955 non-null int64 \n",
" 58 bAvgBytesPerBulk 631955 non-null int64 \n",
" 59 bAvgPacketsPerBulk 631955 non-null int64 \n",
" 60 bAvgBulkRate 631955 non-null int64 \n",
" 61 sflow_fpacket 631955 non-null int64 \n",
" 62 sflow_fbytes 631955 non-null int64 \n",
" 63 sflow_bpacket 631955 non-null int64 \n",
" 64 sflow_bbytes 631955 non-null int64 \n",
" 65 min_active 631955 non-null int64 \n",
" 66 mean_active 631955 non-null float64\n",
" 67 max_active 631955 non-null int64 \n",
" 68 std_active 631955 non-null float64\n",
" 69 min_idle 631955 non-null int64 \n",
" 70 mean_idle 631955 non-null float64\n",
" 71 max_idle 631955 non-null int64 \n",
" 72 std_idle 631955 non-null float64\n",
" 73 FFNEPD 631955 non-null int64 \n",
" 74 Init_Win_bytes_forward 631955 non-null int64 \n",
" 75 Init_Win_bytes_backward 631955 non-null int64 \n",
" 76 RRT_samples_clnt 631955 non-null int64 \n",
" 77 Act_data_pkt_forward 631955 non-null int64 \n",
" 78 min_seg_size_forward 631955 non-null int64 \n",
" 79 calss 631955 non-null object \n",
"dtypes: float64(24), int64(55), object(1)\n",
"memory usage: 385.7+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del conjunto de datos: 631955\n"
]
}
],
"source": [
"print(\"Longitud del conjunto de datos: \", len(df))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de características del conjunto de datos: 80\n"
]
}
],
"source": [
"print(\"Número de características del conjunto de datos:\", len(df.columns))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"benign 471597\n",
"asware 155613\n",
"GeneralMalware 4745\n",
"Name: calss, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Categorías de clasificación de la variable class\n",
"df[\"calss\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">5. División del conjunto de datos</h2>"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"train_set, val_set, test_set = train_val_test_split(df)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = remove_labels(train_set, 'calss')\n",
"X_val, y_val = remove_labels(val_set, 'calss')\n",
"X_test, y_test = remove_labels(test_set, 'calss')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Longitud del Training Set: 379173\n",
"Longitud del Validation Set: 126391\n",
"Longitud del Test Set: 126391\n"
]
}
],
"source": [
"print(\"Longitud del Training Set:\", len(train_set))\n",
"print(\"Longitud del Validation Set:\", len(val_set))\n",
"print(\"Longitud del Test Set:\", len(test_set))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">6. Random Forests</h2>"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=None, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,\n",
" oob_score=False, random_state=42, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Nuestro objetivo es reducir las características de entrada de manera que hagamos selección de características,\n",
"# mejorar el tiempo de entrenamiento de nuestro modelo, mejorar el rendimiento de clasificación o predicción del modelo.\n",
"\n",
"# Lo primero que tenemos que hacer para realizar la selección de características mediante el uso de Random Forest, es\n",
"# entrenar un estimador o un predictor Random Forest. \n",
"\n",
"# Luego importamos el predictor \"RandomForestClassifier\" de \"sklearn.ensemble\" y lo entrenamos.\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Instanciamos la clase \"RandomForestClassifier\" en el objeto \"clf_rnd\" y le pasamos los siguientes parámetros:\n",
"\n",
"# n_estimators=50 -> número de estimadores = 50, es decir, va a entrenar 50 árboles aleatorios \n",
"# random_state=42 -> plantamos una semilla\n",
"# n_jobs=-1 -> usar todas las capacidades y recursos de nuestro procesador para entrenar el algoritmo en paralelo si fuera\n",
"# posible.\n",
"# Vamos a entrenar 50 árboles aleatorios\n",
"clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)\n",
"\n",
"# Invocamos el método fit de nuestro objeto \"clf_rnd\" y le pasamos los subconjuntos de entrenamiento.\n",
"clf_rnd.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['asware', 'asware', 'benign', ..., 'benign', 'asware', 'benign'],\n",
" dtype=object)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Predecimos con el conjunto de datos de validación\n",
"y_pred = clf_rnd.predict(X_val)\n",
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 score: 0.9324043007314987\n"
]
}
],
"source": [
"# En un 93.29% de las ocasiones, el algoritmo está clasificando correctamente.\n",
"print(\"F1 score:\", f1_score(y_pred, y_val, average='weighted'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">7. Importancia de las características</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extraer la importancia de las características es fácil, porque cuando el algoritmo de Random Forest construye el modelo, lo que hace es generar una variable interna dentro del objeto \"clf_rnd\" que se denomina feature_importances_, es decir, importancia de las características. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.03096656, 0.00303719, 0.00440737, 0.02318232, 0.01184895,\n",
" 0.01721388, 0.00881173, 0.02199267, 0.01122589, 0.01910279,\n",
" 0.01229994, 0.00912599, 0.0049411 , 0.01864105, 0.00468261,\n",
" 0.01359503, 0.0060695 , 0.01755146, 0.00504174, 0.01740915,\n",
" 0.00478204, 0.00668029, 0.00337915, 0.00937514, 0.00572423,\n",
" 0. , 0. , 0.00268121, 0.00471322, 0.02948284,\n",
" 0.0175912 , 0.02737585, 0.0276842 , 0.02610625, 0.0159516 ,\n",
" 0.0247063 , 0.01454405, 0.02000791, 0.03888253, 0.03004006,\n",
" 0.00794144, 0.03300505, 0.00432689, 0.0041829 , 0.01156361,\n",
" 0.00794625, 0. , 0. , 0. , 0.01207349,\n",
" 0.02251504, 0.01938611, 0.00347552, 0.00116829, 0.00072676,\n",
" 0.00094549, 0.00527031, 0.0106541 , 0.00290367, 0.00144508,\n",
" 0.00254706, 0.00234171, 0.00912673, 0.00249816, 0.00228634,\n",
" 0.00765582, 0.00907677, 0.01158292, 0.00196904, 0.0121832 ,\n",
" 0.00783499, 0.00994449, 0.00162465, 0.00188881, 0.14141056,\n",
" 0.03134543, 0.00328476, 0.00331256, 0.01770103])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Importancia de las características\n",
"clf_rnd.feature_importances_"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{'duration': 0.0309665551436818,\n",
" 'total_fpackets': 0.0030371879478990325,\n",
" 'total_bpackets': 0.00440736580271348,\n",
" 'total_fpktl': 0.023182320203179382,\n",
" 'total_bpktl': 0.011848946866387284,\n",
" 'min_fpktl': 0.0172138800874452,\n",
" 'min_bpktl': 0.008811725005059473,\n",
" 'max_fpktl': 0.021992674160090632,\n",
" 'max_bpktl': 0.011225894335249498,\n",
" 'mean_fpktl': 0.019102793492948293,\n",
" 'mean_bpktl': 0.012299937432945812,\n",
" 'std_fpktl': 0.009125990714497939,\n",
" 'std_bpktl': 0.004941100960028792,\n",
" 'total_fiat': 0.01864105413523527,\n",
" 'total_biat': 0.0046826053903186015,\n",
" 'min_fiat': 0.013595030719278397,\n",
" 'min_biat': 0.006069501776756673,\n",
" 'max_fiat': 0.01755146299222828,\n",
" 'max_biat': 0.0050417449242102525,\n",
" 'mean_fiat': 0.017409145281749354,\n",
" 'mean_biat': 0.004782035041627898,\n",
" 'std_fiat': 0.006680286577580882,\n",
" 'std_biat': 0.0033791452474712593,\n",
" 'fpsh_cnt': 0.009375136418490434,\n",
" 'bpsh_cnt': 0.005724226852978681,\n",
" 'furg_cnt': 0.0,\n",
" 'burg_cnt': 0.0,\n",
" 'total_fhlen': 0.002681208069916213,\n",
" 'total_bhlen': 0.004713215348250107,\n",
" 'fPktsPerSecond': 0.02948284339931174,\n",
" 'bPktsPerSecond': 0.017591200018401452,\n",
" 'flowPktsPerSecond': 0.027375854967490988,\n",
" 'flowBytesPerSecond': 0.027684198541130613,\n",
" 'min_flowpktl': 0.026106246278593464,\n",
" 'max_flowpktl': 0.015951602374162346,\n",
" 'mean_flowpktl': 0.024706296600569305,\n",
" 'std_flowpktl': 0.014544047189019962,\n",
" 'min_flowiat': 0.020007906193409072,\n",
" 'max_flowiat': 0.03888253470752264,\n",
" 'mean_flowiat': 0.030040060846327637,\n",
" 'std_flowiat': 0.007941436909095132,\n",
" 'flow_fin': 0.03300505155808422,\n",
" 'flow_syn': 0.004326892974196047,\n",
" 'flow_rst': 0.004182896853531234,\n",
" 'flow_psh': 0.011563608853071505,\n",
" 'flow_ack': 0.00794624515429007,\n",
" 'flow_urg': 0.0,\n",
" 'flow_cwr': 0.0,\n",
" 'flow_ece': 0.0,\n",
" 'downUpRatio': 0.012073492815087055,\n",
" 'avgPacketSize': 0.022515041347932772,\n",
" 'fAvgSegmentSize': 0.0193861107583974,\n",
" 'fHeaderBytes': 0.0034755236861218407,\n",
" 'fAvgBytesPerBulk': 0.0011682860313492907,\n",
" 'fAvgPacketsPerBulk': 0.0007267613312591391,\n",
" 'fAvgBulkRate': 0.0009454920353145177,\n",
" 'bVarianceDataBytes': 0.00527031491907621,\n",
" 'bAvgSegmentSize': 0.010654097865027664,\n",
" 'bAvgBytesPerBulk': 0.002903666812276747,\n",
" 'bAvgPacketsPerBulk': 0.0014450753862712818,\n",
" 'bAvgBulkRate': 0.0025470611380064225,\n",
" 'sflow_fpacket': 0.002341713030336466,\n",
" 'sflow_fbytes': 0.009126733723286718,\n",
" 'sflow_bpacket': 0.0024981634482319683,\n",
" 'sflow_bbytes': 0.0022863428046196077,\n",
" 'min_active': 0.007655821990055255,\n",
" 'mean_active': 0.00907676939545392,\n",
" 'max_active': 0.011582923802634131,\n",
" 'std_active': 0.001969037502287548,\n",
" 'min_idle': 0.012183201793090091,\n",
" 'mean_idle': 0.007834992799200012,\n",
" 'max_idle': 0.009944488334906897,\n",
" 'std_idle': 0.0016246495009747393,\n",
" 'FFNEPD': 0.0018888070778040967,\n",
" 'Init_Win_bytes_forward': 0.14141056349348416,\n",
" 'Init_Win_bytes_backward': 0.03134542502586885,\n",
" 'RRT_samples_clnt': 0.003284757842849299,\n",
" 'Act_data_pkt_forward': 0.003312563514734042,\n",
" 'min_seg_size_forward': 0.017701026447635364}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Podemos extraer que características son más importantes para la correcta clasificación de los datos\n",
"# Geneamos un diccionario (clave,valor) \n",
"feature_importances = {name: score for name, score in zip(list(df), clf_rnd.feature_importances_)}\n",
"feature_importances"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Init_Win_bytes_forward 0.141411\n",
"max_flowiat 0.038883\n",
"flow_fin 0.033005\n",
"Init_Win_bytes_backward 0.031345\n",
"duration 0.030967\n",
"mean_flowiat 0.030040\n",
"fPktsPerSecond 0.029483\n",
"flowBytesPerSecond 0.027684\n",
"flowPktsPerSecond 0.027376\n",
"min_flowpktl 0.026106\n",
"mean_flowpktl 0.024706\n",
"total_fpktl 0.023182\n",
"avgPacketSize 0.022515\n",
"max_fpktl 0.021993\n",
"min_flowiat 0.020008\n",
"fAvgSegmentSize 0.019386\n",
"mean_fpktl 0.019103\n",
"total_fiat 0.018641\n",
"min_seg_size_forward 0.017701\n",
"bPktsPerSecond 0.017591\n",
"dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Transformamos el objeto anterior de array a formato series de pandas\n",
"# Visualizamos el top 20\n",
"feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)\n",
"feature_importances_sorted.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style=\"color:blue\">8. Reducción del número de características</h2>"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Init_Win_bytes_forward',\n",
" 'max_flowiat',\n",
" 'flow_fin',\n",
" 'Init_Win_bytes_backward',\n",
" 'duration',\n",
" 'mean_flowiat',\n",
" 'fPktsPerSecond',\n",
" 'flowBytesPerSecond',\n",
" 'flowPktsPerSecond',\n",
" 'min_flowpktl']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extraemos las 10 caracteristicas con mas relevancia para el algoritmo\n",
"columns = list(feature_importances_sorted.head(10).index)\n",
"columns"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Creamos los archivos de entrenamiento reducidos, es decir, para X_train y X_val con 79 características cada una, me \n",
"# quedo únicamente con las columnas (variables o características de entrada) definidas en \"columns\".\n",
"X_train_reduced = X_train[columns].copy()\n",
"X_val_reduced = X_val[columns].copy()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>max_flowiat</th>\n",
" <th>flow_fin</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>duration</th>\n",
" <th>mean_flowiat</th>\n",
" <th>fPktsPerSecond</th>\n",
" <th>flowBytesPerSecond</th>\n",
" <th>flowPktsPerSecond</th>\n",
" <th>min_flowpktl</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>508881</th>\n",
" <td>0</td>\n",
" <td>490</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>490</td>\n",
" <td>490.0</td>\n",
" <td>2040.816327</td>\n",
" <td>679591.836700</td>\n",
" <td>4081.632653</td>\n",
" <td>73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>208326</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>107213</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>436</td>\n",
" </tr>\n",
" <tr>\n",
" <th>466726</th>\n",
" <td>0</td>\n",
" <td>23933</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23933</td>\n",
" <td>23933.0</td>\n",
" <td>41.783312</td>\n",
" <td>21267.705680</td>\n",
" <td>83.566623</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230085</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110268</th>\n",
" <td>0</td>\n",
" <td>5018131</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8856187</td>\n",
" <td>4428093.5</td>\n",
" <td>0.225831</td>\n",
" <td>36.584593</td>\n",
" <td>0.338746</td>\n",
" <td>108</td>\n",
" </tr>\n",
" <tr>\n",
" <th>259178</th>\n",
" <td>88704</td>\n",
" <td>28238005</td>\n",
" <td>2</td>\n",
" <td>-1</td>\n",
" <td>28238005</td>\n",
" <td>28200000.0</td>\n",
" <td>0.070827</td>\n",
" <td>3.682980</td>\n",
" <td>0.070827</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>365838</th>\n",
" <td>4194240</td>\n",
" <td>34928</td>\n",
" <td>1</td>\n",
" <td>1718208</td>\n",
" <td>72542</td>\n",
" <td>14508.4</td>\n",
" <td>41.355353</td>\n",
" <td>5955.170798</td>\n",
" <td>82.710706</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131932</th>\n",
" <td>13376</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121958</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>420</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>379173 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" Init_Win_bytes_forward max_flowiat flow_fin \\\n",
"508881 0 490 0 \n",
"208326 0 -1 0 \n",
"107213 0 -1 0 \n",
"466726 0 23933 0 \n",
"230085 0 -1 0 \n",
"... ... ... ... \n",
"110268 0 5018131 0 \n",
"259178 88704 28238005 2 \n",
"365838 4194240 34928 1 \n",
"131932 13376 -1 0 \n",
"121958 0 -1 0 \n",
"\n",
" Init_Win_bytes_backward duration mean_flowiat fPktsPerSecond \\\n",
"508881 0 490 490.0 2040.816327 \n",
"208326 -1 0 0.0 0.000000 \n",
"107213 -1 0 0.0 0.000000 \n",
"466726 0 23933 23933.0 41.783312 \n",
"230085 -1 0 0.0 0.000000 \n",
"... ... ... ... ... \n",
"110268 0 8856187 4428093.5 0.225831 \n",
"259178 -1 28238005 28200000.0 0.070827 \n",
"365838 1718208 72542 14508.4 41.355353 \n",
"131932 -1 0 0.0 0.000000 \n",
"121958 -1 0 0.0 0.000000 \n",
"\n",
" flowBytesPerSecond flowPktsPerSecond min_flowpktl \n",
"508881 679591.836700 4081.632653 73 \n",
"208326 0.000000 0.000000 422 \n",
"107213 0.000000 0.000000 436 \n",
"466726 21267.705680 83.566623 54 \n",
"230085 0.000000 0.000000 422 \n",
"... ... ... ... \n",
"110268 36.584593 0.338746 108 \n",
"259178 3.682980 0.070827 52 \n",
"365838 5955.170798 82.710706 52 \n",
"131932 0.000000 0.000000 52 \n",
"121958 0.000000 0.000000 420 \n",
"\n",
"[379173 rows x 10 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data de entrenamiento con 10 variables\n",
"X_train_reduced"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Init_Win_bytes_forward</th>\n",
" <th>max_flowiat</th>\n",
" <th>flow_fin</th>\n",
" <th>Init_Win_bytes_backward</th>\n",
" <th>duration</th>\n",
" <th>mean_flowiat</th>\n",
" <th>fPktsPerSecond</th>\n",
" <th>flowBytesPerSecond</th>\n",
" <th>flowPktsPerSecond</th>\n",
" <th>min_flowpktl</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>240832</th>\n",
" <td>90496</td>\n",
" <td>8580002</td>\n",
" <td>2</td>\n",
" <td>-1</td>\n",
" <td>8580002</td>\n",
" <td>8580002.000</td>\n",
" <td>0.233100</td>\n",
" <td>1.212121e+01</td>\n",
" <td>0.233100</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>326539</th>\n",
" <td>0</td>\n",
" <td>114583</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>114583</td>\n",
" <td>114583.000</td>\n",
" <td>8.727298</td>\n",
" <td>3.482192e+03</td>\n",
" <td>17.454596</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200606</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>431142</th>\n",
" <td>106816</td>\n",
" <td>7941127</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>7941129</td>\n",
" <td>3970564.500</td>\n",
" <td>0.377780</td>\n",
" <td>2.354829e+01</td>\n",
" <td>0.377780</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>478100</th>\n",
" <td>4194240</td>\n",
" <td>31205763</td>\n",
" <td>1</td>\n",
" <td>1853440</td>\n",
" <td>31590262</td>\n",
" <td>1504298.190</td>\n",
" <td>0.379864</td>\n",
" <td>2.316220e+02</td>\n",
" <td>0.696417</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>215540</th>\n",
" <td>89792</td>\n",
" <td>7379378</td>\n",
" <td>2</td>\n",
" <td>-1</td>\n",
" <td>7379378</td>\n",
" <td>7379378.000</td>\n",
" <td>0.271026</td>\n",
" <td>2.249512e+01</td>\n",
" <td>0.271026</td>\n",
" <td>83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>516620</th>\n",
" <td>62912</td>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>8</td>\n",
" <td>8.000</td>\n",
" <td>250000.000000</td>\n",
" <td>1.690000e+07</td>\n",
" <td>250000.000000</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>592495</th>\n",
" <td>262336</td>\n",
" <td>103128</td>\n",
" <td>0</td>\n",
" <td>32768</td>\n",
" <td>151998</td>\n",
" <td>30399.600</td>\n",
" <td>26.316136</td>\n",
" <td>1.193437e+04</td>\n",
" <td>39.474204</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279808</th>\n",
" <td>4194240</td>\n",
" <td>60186541</td>\n",
" <td>1</td>\n",
" <td>1145472</td>\n",
" <td>60262041</td>\n",
" <td>6695782.333</td>\n",
" <td>0.082971</td>\n",
" <td>3.746969e+01</td>\n",
" <td>0.165942</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34456</th>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000</td>\n",
" <td>408</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>126391 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" Init_Win_bytes_forward max_flowiat flow_fin \\\n",
"240832 90496 8580002 2 \n",
"326539 0 114583 0 \n",
"200606 0 -1 0 \n",
"431142 106816 7941127 1 \n",
"478100 4194240 31205763 1 \n",
"... ... ... ... \n",
"215540 89792 7379378 2 \n",
"516620 62912 8 0 \n",
"592495 262336 103128 0 \n",
"279808 4194240 60186541 1 \n",
"34456 0 -1 0 \n",
"\n",
" Init_Win_bytes_backward duration mean_flowiat fPktsPerSecond \\\n",
"240832 -1 8580002 8580002.000 0.233100 \n",
"326539 0 114583 114583.000 8.727298 \n",
"200606 -1 0 0.000 0.000000 \n",
"431142 -1 7941129 3970564.500 0.377780 \n",
"478100 1853440 31590262 1504298.190 0.379864 \n",
"... ... ... ... ... \n",
"215540 -1 7379378 7379378.000 0.271026 \n",
"516620 -1 8 8.000 250000.000000 \n",
"592495 32768 151998 30399.600 26.316136 \n",
"279808 1145472 60262041 6695782.333 0.082971 \n",
"34456 -1 0 0.000 0.000000 \n",
"\n",
" flowBytesPerSecond flowPktsPerSecond min_flowpktl \n",
"240832 1.212121e+01 0.233100 52 \n",
"326539 3.482192e+03 17.454596 67 \n",
"200606 0.000000e+00 0.000000 422 \n",
"431142 2.354829e+01 0.377780 52 \n",
"478100 2.316220e+02 0.696417 52 \n",
"... ... ... ... \n",
"215540 2.249512e+01 0.271026 83 \n",
"516620 1.690000e+07 250000.000000 52 \n",
"592495 1.193437e+04 39.474204 52 \n",
"279808 3.746969e+01 0.165942 52 \n",
"34456 0.000000e+00 0.000000 408 \n",
"\n",
"[126391 rows x 10 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data de validación con 10 variables\n",
"X_val_reduced"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reentrenamos el algoritmo con la data de entrenamiento reducida"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=None, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,\n",
" oob_score=False, random_state=42, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Se entrena más rápido\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)\n",
"clf_rnd.fit(X_train_reduced, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['asware', 'asware', 'benign', ..., 'benign', 'asware', 'benign'],\n",
" dtype=object)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Predecimos con el conjunto de datos de validación\n",
"y_pred = clf_rnd.predict(X_val_reduced)\n",
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F1 score: 0.926788599012114\n"
]
}
],
"source": [
"print(\"F1 score:\", f1_score(y_pred, y_val, average='weighted'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Como puede observarse en la casilla anterior el rendimiento de nuestro modelo empeora muy poco eliminando 69 de las 79 características de las que disponía. Por otro lado, el rendimiento en el entrenamiento y en la predicción mejora sustancialmente.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment